
Docket No.: PF-0162-3 DIV 
Response Under 37 C.F.R. 1.116 - Expedited Procedure 

Examining Group 1645 



Certificate of Mailing 

thai this correspondence is being deposited with the United States Postal Service as first class mail in an envelope 
oX'Mail Sloy: Appeal Brief-Patents, Commissioner for Patents, P.O. Box 1450, Alexandria, VA 22313-1450 on 





Printed: 



Lyza Finuliar 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

In re Application of: Goli et al. 



Title: A NOVEL GLUTATHIONE S-TRANSFERASE 

^^erial No.: 09/784,739 Filing Date: February 14, 2001 

; Examiner: Hines, J.A. Group Art Unit: 1645 

Mail Stop: Appeal Brief-Patents 
Gommissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 

FEE TRANSMITTAL SHEET 

Sir: 

Transmitted herewith are the following for the above-identified application: 

1. Return Receipt Postcard; 

2. Brief on Appeal, including Appendix (13 pp., in triplicate); and 

3. One (1) Reference (6 pp., in triplicate). 



The fee has been calculated as shown below. 



Claims 


Claims After 
Amendmeait 




Claims 
Previously 
Paid For 




Preseait 
Extra 


Other 
Small 

Rate 


Than 
Entity 

Fee 




Additional 
Fee(s) 


Total 


19 




20 






xSlS.OO 


0 


$ 


0 


Indqjt. 


4 




4 






x$86.00 


0 


$ 


0 


First Presoitation of Multiple Dependent Claims 


+290.00 


0 


$ 


0 


Total Fee: 


$ 


0 



X Fee for filing a Brief in support of an Appeal under 37 CFR 1 . 17(c): $ 330.00 



X Please charge Deposit Account No. 09-0108 in the amount of : $ 330.00 



The Commissioner is hereby authorized to charge any additional fees required under 37 CFR 1. 16 and 1. 17, 
or credit overpayment to Deposit Account No. 09-0108. A duplicate copy of this sheet is enclosed. 



Date: 



Customer No.: 27904 
3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 845-4166 



Respectfully submitted, 
INCYTE CORPORATION 



Susan K. Sather 
Reg. No. 44,316 

Direct Dial Telephone: (650) 845-4646 



117769 



09/784,739 



^. - ^ Docket No.: PF-0162-3 DIV 
\ P ^ 

<_ 

DEC 1 9 2DIB y Response Under 37 CF.R, 1.116 - Expedited Procedure 

4/ Examining Group 1645 

X^T^y^QQjJji^ Certificate of Mailing 

tha^th is correspondence is being deposited with the United States Postal Service as first class mail in an envelope addressed to: 
Mail Stop: AWall^i^-^ajgn^^ for Patents, P.O. Box 1450, Alexandria, VA 22313-1450 on December 16. 2003 . 

By: /I ^ Printed: Lyza Finuliar 

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 
BEFORE THE BOARD OF PATENT APPEALS AND INTERFERENCES 

In re Application of: Goli et al. 

Title: A NOVEL GLUTATHIONE S-TRANSFERASE 

Serial No.: 09/784,739 Filing Date: February 14, 2001 

Examiner: Hines, J. A. Group Art Unit: 1645 

Mail Stop Appeal Brief -Patents 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 

BRIEF ON APPEAL 



Sir: 

Further to the Notice of Appeal filed October 22, 2003, and received by the USPTO on 
October 27, 2003, herewith are three copies of Appellants' Brief on Appeal. Authorized fees 
include the $ 320.00 fee for the filing of this Brief. 

This is an appeal from the decision of the Examiner finally rejecting claims 2-4 and 8 of 
the above-identified application. 

(1) REAL PARTY m INTEREST 
The above-identified application is assigned of record to Incyte Pharmaceuticals, Inc., 
(now Incyte Corporation, formerly known as Incyte Genomics, Inc.) (Reel 8620, Frame 
0308) which is the real party in interest herein. 

12/22/2003 SZEUDIEl 00000133 090108 09784739 
01 FC:1402 330.00 Dft 



Doc No. 1 17560 1 09/784,739 



Docket No.: PF-0162-3 DIV 



(2) RELATED APPEALS AND INTERFERENCES 
Appellants, their legal representative and the assignee are not aware of any related 
appeals or interferences which will directly affect or be directly affected by or have a bearing on 
the Board's^ decision in the instant appeal. 



Claims rejected: 
Claims allowed: 
Claims canceled: 
Claims withdrawn: 
Claims on Appeal: 



(3) STATUS OF THE CLAIMS 
Claims 2-4 and 8^ 
(none) 

Claims 9 and 21-24 
Claims 1,5-7, 10-20^ 

Claims 2-4 and 8 (A copy of the claims on appeal, as amended, can 
be found in the attached Appendix). 



(4) STATUS OF AMENDMENTS AFTER FINAL 
There were no amendments submitted after Final Rejection. 



(5) SUMMARY OF THE INVENTION 
Appellants' invention is directed, inter alia, to polynucleotides encoding polypeptides 
having homology to Alpha class glutathione s-transferases from normal human liver, pGTH2, 
(GI 825605)1, human hepatoma, Al-1, (GI 259141), and mouse lung, GST 5.7, (GI 193710). 
These polynucleotides have a variety of utilities, in particular in expression profiling, and in 
particular for diagnosis of conditions or diseases characterized by expression of SEQ ID N0:1 
(HGST), for toxicology testing, and for drug discovery (see the Specification at, e.g., page 34, 
line 18 through page 37, line 1 1). As described in the Specification (page 11, line 28 through 
page 12, Hne 22): 

Nucleic acids encoding the human HGST of the present invention were 
first identified in Incyte Clone 1553079 from the bladder tumor cDNA library 
(BLADTUT04) through a computer-generated search for amino acid sequence 



The Final Office Action, on page 3, states that "[t]he written description rejection of claims 2-4 and 8-9 
under 35 U.S.C. 112, first paragraph, as containing subject matter which was not described in the specification in 
such a way as to reasonably convey to one skilled in the relevant art that the inventor(s), at the time the application 
was filed, had possession of the claimed invention." Applicants note that Claim 9 was canceled in the Response 
filed March 13, 2003, Applicants assume that the Examiner inadvertently listed Claim 9 as being rejected on the 
basis of written description. 

^ The Final Office Action, page 1 (Office Action Summary), states that "claim(s) 1, 5-7 and 10-24 is/are 

withdrawn from consideration." Applicants note that Claims 21-24 were canceled on the transmittal filed February 
14, 2001. Applicants assume that the Examiner inadvertently listed Claims 21-24 as being withdrawn claims. 
Doc No. 1 17560 2 09/784,739 



Docket No.: PF-0162-3 DIV 



alignments. A consensus sequence, SEQ ID NO:2, was derived from the 
following overlapping and/or extended nucleic acid sequences: Incyte Clones 
1553079/ BLADTUT04, 1328546/ PANCNOT07, 1422059/ KIDNNOT09, and 
2188683/ PROSNOT26. 

In one embodiment, the invention encompasses the novel human 
glutathione s-transferase, a polypeptide comprising the amino acid sequence of 
SEQ ID N0:1, as shown in Figures lA, IB, and IC. HOST is 222 amino acids in 
length and has chemical and structural homology with two human Alpha GSTs, 
pGTH2 (GI 825605; SEQ ID N0:3), and Al-1 (GI 259141), and a mouse Alpha 
GSH, GST 5.7 (GI 193710). In particular, HGST shares 57% overall identity 
with each of the two human GSTs and 59% identity with the mouse GST. In 
addition, various amino acid residues found to be essential for the catalytic 
activity and substrate binding of GSTs are conserved in HGST and in the other 
three GST molecules. These residues are: Y9, R13, R20, E32, Q67, T68, R69, 
E97, DlOl, E104, and R131. Only residues E97 and E104 are not found in the 
mouse GST. E32 and E97 form salt bridges with R20 and R69, respectively, and 
these salt bridges or residues are thought to be important in structural stability of 
the GST molecule and may be important for catalysis. Y9 is essential for 
catalysis by facilitating ionization of GSH. Residues Q67, T68, DlOl, E104, and 
R131 are important for the binding of GSH. As illustrated by Figures 3, 4, 5, and 
6, HGST and the three Alpha GSTs have rather similar hydrophobicity plots. 
Figures 7, 8, and 9 show the isoelectric point analyses for HGST, pGHT2, and 
Al-1. The pi values of 8.8, 9.0, and 9.3, respectively; fall within the range 
characteristic of Alpha GSTs. In addition to bladder tumor, partial transcripts of 
the cDNA encoding HGST are found in fetal tissues (kidney and pancreas) and in 
prostate tissue adjacent to prostate cancer. 



(6) ISSUES 

1. Whether the polynucleotides of Claims 2-4 and 8 meet the written description 
requirement of 35 U.S.C. § 112, first paragraph. 

2. Whether the polynucleotides of Claim 8 are covered by the judicially created 
doctrine of obviousness-type double patenting over Claims 3-4 of U.S. Patent No. 5,817,497. 



(7) GROUPING OF THE CLAIMS 

As to Issue 1 

This issue pertains to Claims 2-4 and 8. 
As to Issue 2 

This issue pertains to Claim 8. 
Doc No. 1 17560 3 09/784,739 



Docket No.: PF-0 162-3 DIV 



(8) APPELLANTS^ ARGUMENTS 
Issue One: Written Description Rejection 

Claims 2-4 and 8 have been rejected under the first paragraph of 35 U.S.C. § 112 for 
alleged lack of an adequate written description. The Examiner alleges that the polynucleotides 
of Claim 2 encoding polypeptide variants, the polynucleotide variants of Claim 8, and the 
polynucleotides of Claim 8 complementary to or ribonucleotide equivalents of SEQ ID N0:2 and 
SEQ ID NO:2 variants are not adequately described. 

The Examiner alleges that "there is no disclosure of naturally occurring variants having at 
least 90% sequence identity to the sequence of SEQ DD N0:1 or 2 over the entire length of SEQ 
ID N0:1 or 2 or complementary sequences within the family of enzymes." (Final Office Action, 
pages 7-8.) 

The Examiner ignores the claim limitations of "having at least 90% sequence identity of 
the sequence of SEQ ID NO: 1 over the entire length of SEQ ID NO: 1" and "having at least 90% 
sequence identity to the sequence of SEQ ED N0:2, over the entire length of SEQ ID NO:2." 
The Examiner ignores the limitation that the claimed polynucleotides encode a naturally- 
occurring amino acid sequence, comprise a naturally-occurring polynucleotide sequence, are 
complementary to a naturally-occurring polynucleotide sequence, or are ribonucleotide 
equivalents thereof. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. 

112, first paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled 
in the art that, as of the filing date sought, he or she was in possession of the 
invention. The invention is, for purposes of the "written description" inquiry, 
whatever is now claimed, Vas-Cath, Inc, v. Mahurkar, 19 USPQ2d 1111, 1117 
(Fed. Cir. 1991) 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for 

Examination of Patent Applications Under the 35 U.S.C. Sec. 112, para. 1", published January 5, 

2001, which provide that: 

An applicant may also show that an invention is complete by disclosure of 
sufficiently detailed, relevant identifying characteristics which provide evidence 
that applicant was in possession of the claimed invention, i.e., complete or partial 
structure, other physical and/or chemical properties, functional characteristics 
when coupled with a known or disclosed correlation between function and 
structure, or some combination of such characteristics. What is conventional or 



Doc No. 117560 



4 



09/784,739 



Docket No.: PF-0162-3 DIV 



well known to one of ordinary skill in the art need not be disclosed in detail. If a 
skilled artisan would have understood the inventor to be in possession of the 
claimed invention at the time of filing, even if every nuance of the claims is not 
explicitly described in the specification, then the adequate description requirement 
is met. (citations omitted.) 

Thus, the written description standard is fulfilled by both what is specifically disclosed 
and what is conventional or well known to one skilled in the art. 

SEQ ED N0:1 and SEQ ID N0:2 are specifically disclosed in the application (see, for 
example, pages 50-52). The Specification states that HGST can be "obtained from any species, 
particularly mammalian, including bovine, ovine, porcine, murine, equine, and preferably 
human, from any source whether natural, synthetic, semi-synthetic, or recombinant." (e.g.. 
Specification, at page 5, lines 26-29, emphasis added). Variants of SEQ ID NO:l are described, 
for example, at page 6, lines 6-14. In particular, the preferred, more preferred, and most 
preferred SEQ ID N0:1 variants ("one having at least 80%, and more preferably 90%, amino 
acid sequence similarity to the HGST amino acid sequence (SEQ ID NO: 1). . . one having at 
least 95% amino acid sequence similarity to SEQ ID NOil"). are described, for example, at page 

12, lines 23-26. Variants of SEQ ID NO:2 are described, for example, at page 12, line 27 
through page 13, line 19. Incyte clones in which the nucleic acids encoding the human HGST 
were first identified and libraries from which those clones were isolated are described, for 
example, at page 11, line 28 through page 12, line 3, and page 39, lines 14-26 of the 
Specification. Chemical and structural features of HGST are described, for example, on page 12, 
lines 4-20. 

Complementary polynucleotide sequences are described, for example, at page 8, lines 5- 

13, page 9, lines 11-20, page 13, lines 26-28, page 28, lines 1-9, and page 46, lines 4-16. 
Ribonucleotide equivalents are described, e.g., at page 5, lines 11-14, and page 34, lines 18-20. 

Given SEQ ID NO:l, one of ordinary skill in the art would recognize a polynucleotide 
encoding a naturally-occurring variant of SEQ ID N0:1 having at least 90% sequence identity to 
SEQ ID NO: 1. Given SEQ E) NO:2 one of ordinary skill in the art would recognize a naturally- 
occurring variant of SEQ ID NO:2 having at least 90% sequence identity to SEQ ED NO:2. The 
Specification describes how to use BLAST to determine whether a given sequence falls within 
the "having at least 90% sequence identity" scope. (Specification, page 41, Hne 28 through page 
42, line 13.) 



Doc No. 117560 



5 



09/784,739 



Docket No.: PF-0162-3 DIV 



The Examiner asserts that "there is no assay for detemnining naturally occurring 
sequences which support allowing one of skill to screen for such naturally occurring variants 
having at least 90% sequence identity to the sequence of SEQ ID NO:l or 2 over the entire 
length of SEQ ID N0:1 or 2." (Final Office Action, page 7.) However, sequence information is 
not provided in a vacuum. Identification of the source of the sequence will typically allow one to 
determine if it is naturally-occurring. Also, attempted deceit to hide the source will not preclude 
infringement. 

The Examiner asserts that "Applicants have not taught" "any variant or mutant that has 

similar sequence identity yet has a different function" [than SEQ ID N0:1 or SEQ ID NO:2]. 

(Final Office Action, page 6.) However, the Specification teaches that: 

Also included within the scope of the present invention are alleles of the 
genes encoding HOST. As used herein, an "allele" or "allelic sequence" is an 
alternative form of the gene which may result from at least one mutation in the 
nucleic acid sequence. Alleles may result in altered mRNAs or polypeptides 
whose structure or function may or may not be altered. (Specification, page 
14, lines 14-18, emphasis added.) 

There simply is no requirement that the claims recite particular variant polypeptide, 
variant polynucleotide, complementary polynucleotide, or ribonucleotide equivalent 
polynucleotide sequences because the claims already provide sufficient structural definition of 
the claimed subject matter. That is, the polynucleotides encoding polypeptide variants are 
defined in terms of SEQ ID NO:l ("An isolated polynucleotide encoding a polypeptide 
comprising an amino acid sequence ... of: a naturally-occurring amino acid sequence having at 
least 90% sequence identity to the sequence of SEQ ID NO:l over the entire length of SEQ ID 
NO:l."). The polynucleotide variants, complementary polynucleotides, and ribonucleotide 
equivalent polynucleotides are defined in terms of SEQ ID NO:2 ("An isolated polynucleotide 
comprising a sequence selected from the group consisting of: a) a polynucleotide sequence of 
SEQ ID N0:2,b) a naturally-occurring polynucleotide sequence having at least 90% sequence 
identity to the sequence of SEQ ID NO:2, over the entire length of SEQ ID N0:2, c) a 
polynucleotide sequence completely complementary to a), d) a polynucleotide sequence 
completely complementary to b) and e) a ribonucleotide equivalent of a)-d)."). 

Because the recited polypeptide variants are defined in terms of SEQ ID N0:1, and the 
recited polynucleotide variants, complementary polynucleotides, and ribonucleotide equivalent 
polynucleotides are defined in terms of SEQ ID NO:2, the precise chenaical structure of every 
polypeptide variant, every polynucleotide variant, every complementary polynucleotide, and 
Doc No. 1 17560 6 09/784,739 



Docket No.: PF-0 162-3 DIV 



every ribonucleotide equivalent polynucleotide within the scope of the claims can be discerned. 
The Examiner's position is nothing more than a misguided attempt to require Applicants to 
unduly limit the scope of their claimed invention. Accordingly, the Specification provides an 
adequate written description of the recited polypeptide and polynucleotide sequences. 

L The present claims specifically deHne the claimed genus through the recitation of 
chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the 

recitation of structural features or chemical or physical properties are important factors to 

consider in a written description analysis of such claims. For example, in Fiers v. Revel, 25 

USPQ2d 1601, 1606 (Fed. Cir. 1993), the court stated that: 

If a conception of a DNA requires a precise definition, such as by 
structure, formula, chemical name or physical properties, as we have held, then a 
description also requires that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts 

have noted that the claims attempted to define the claimed DNA in terms of functional 

characteristics without any reference to structural features. As set forth by the court in 

University of California v. Eli Lilly and Co., 4^ USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as 
"vertebrate insulin cDNA" or "mammalian insulin cDNA," without more, is not 
an adequate written description of the genus because it does not distinguish the 
claimed genus from others, except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. 
For example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written 
description requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within 
its nucleotide sequence a subsequence having the structure of the reverse 
transcript of an mRNA of a vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following 

count: 



Doc No. 117560 



7 



09/784,739 



Docket No.: PF-0162.3 DIV 



A DNA which consists essentially of a DNA which codes for a human 
fibroblast interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an 
adequate written description of the DNA of the count because that application mentioned a 
potential method for isolating the DNA. The Revel priority application, however, did not have a 
description of any particular DNA structure corresponding to the DNA of the count. The court 
therefore found that the Revel priority application lacked an adequate written description of the 
subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional 
characteristics and were found not to comply with the written description requirement of 35 
U.S.C. §112; i.e., "an mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA 
which codes for a human fibroblast interferon-beta polypeptide" in Fiers, In contrast to the 
situation in Lilly and Fiers, the claims at issue in the present application define polynucleotides 
in terms of chemical structure, rather than on functional characteristics. For example, the 
"variant language" of independent Claims 2 and 8 recites chemical structure to define the 
claimed genus: 

2. An isolated polynucleotide encoding a polypeptide comprising an amino 

acid sequence selected from the group consisting of. . . 

b) a naturally-occurring amino acid sequence having at least 90% 
sequence identity to the sequence of SEQ E) NO:l over the entire length 
of SEQ ID NO: 1. 

8. An isolated polynucleotide comprising a sequence selected from the group 
consisting of. . . 

b) a naturally-occurring polynucleotide sequence having at least 90% 
sequence identity to the sequence of SEQ ID NO: 2, over the entire length 
of SEQIDNO:2. : . 

The Final Office Action asserts that "the nucleic acid itself is required" to describe the 
claimed polynucleotides (Final Office Action, page 8). However, Fiers does not state that the 
nucleic acid itself is required, but that the claimed sequence must be defined by more than 
merely functional properties, such as coding for human erythropoietin. ("Conception of a DNA, 
like conception of any chemical substance, requires a definition of that substance by other than 
its functional utility" Fiers v. Revel, 25 USPQ 2d 1601 at 1604). Fiers makes clear that chemical 
and physical properties may be used to define a claimed sequence. ("Conception of a substance 



Doc No. 117560 



8 



09/784,739 



Docket No.: PF-0162-3 DIV 

claimed per se without reference to a process requires conception of its structure, name, formula, 
or definitive chemical or physical properties" Fiers v. Revel, at 1605). 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the 
present claims is defined in terms of the chemical structure of SEQ ID N0:1 and SEQ ED N0:2. 
In the present case, there is no reliance merely on a description of functional characteristics of 
the polynucleotides recited by the claims. In fact, there is no recitation of functional 
characteristics. Moreover, if such functional recitations were included, it would add to the 
structural characterization of the recited polynucleotides. The polynucleotides defined in the 
claims of the present application recite structural features, and cases such as Lilly and Fiers stress 
that the recitation of structure is an important factor to consider in a written description analysis 
of claims of this type. By failing to base its written description inquiry "on whatever is now 
claimed," the Final Office Action failed to provide an appropriate analysis of the present claims 
and how they differ from those found not to satisfy the written description requirement in Lilly 
and Fiers. 

IL The present claims do not deflne a genus which is highly diverse 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
highly diverse. Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference 
by Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified 
distant evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078; Reference 
No. 1.) Through exhaustive analysis of a data set of proteins with known structural and 
functional relationships and with <90% overall sequence identity, Brenner et al. have determined 
that 30% identity is a reliable threshold for establishing evolutionary homology between two 
sequences aligned over at least 150 residues. (Brenner et al., pages 6073 and 6076.) 
Furthermore, local identity is particularly important in this case for assessing the significance of 
the alignments, as Brenner et al. further report that >40% identity over at least 70 residues is 
reliable in signifying homology between proteins. (Brenner et al., page 6076.) 

The present application is directed, inter alia, to glutathione s-transferase proteins related 
to the amino acid sequence of SEQ ID N0:1. In accordance with Brenner et al, naturally- 
occurring molecules may exist which could be characterized as glutathione s-transferase proteins 
Doc No. 1 17560 9 09/784,739 



Docket No.: PF-0162-3 DIV 

and which have as little as 40% identity over at least 70 residues to SEQ E) N0:1. The "variant 
language" of the present claims recites, for example, polynucleotides encoding a polypeptide 
comprising "a naturally-occurring amino acid sequence having at least 90% sequence identity to 
the sequence of SEQ ID NO:l over the entire length of SEQ ID NO:L" This variation is far less 
than that of all potential glutathione-s-transferase proteins related to SEQ ID N0:1, i.e., those 
glutathione s-transferase proteins having as little as 40% identity over at least 70 residues to SEQ 
IDNOil. 

And, in any case, the "function" of the claimed polynucleotides and polypeptides 
encoded by the claimed polynucleotides is immaterial to their use in toxicology testing, as 
described in the Specification and as well known in the art (see the Specification at, e.g., page 
34, line 18 through page 37, line 11; Bedilion Declaration filed March 24, 2003, e.g., f 15). 

III. The state of the art at the time of the present invention is further advanced than at 
the time of the Lilly and Fiers appUcations 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to 
comply with the written description requirement of 35 U.S.C. §112. The '525 patent claimed the 
benefit of priority of two applications. Application Serial No. 801,343 filed May 27, 1977, and 
Application Serial No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the 
benefit of priority of an Israeli application filed on November 21, 1979. Thus, the written 
description inquiry in those cases was based on the state of the art at essentially at the "dark 
ages" of recombinant DNA technology. 

The present application has a priority date of November 26, 1996. Much has happened in 
the development of recombinant DNA technology in the 17 or more years from the time of filing 
of the applications involved in Lilly and Fiers and the present application. For example, the 
technique of polymerase chain reaction (PGR) was invented. Highly efficient cloning and DNA 
sequencing technology has been developed. Large databases of protein and nucleotide 
sequences have been compiled. Much of the raw material of the human and other genomes has 
been sequenced. With these remarkable advances one of skill in the art would recognize that, 
given the sequence information of SEQ ID NO: 1 and SEQ ID N0:2, and the additional extensive 
detail provided by the subject application, the present inventors were in possession of the recited 
polypeptide variants, polynucleotide variants, complementary polynucleotides, and 
ribonucleotide equivalent polynucleotides at the time of filing of this application. 



Doc No.117560 



10 



09/784,739 



Docket No.: PF-0162-3 DIV 

IV. Summary 

The Final Office Action failed to base its written description inquiry "on whatever is now 
claimed." Consequently, the Final Office Action did not provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description 
requirement in cases such as Lilly and Fiers. In particular, the claims of the subject application 
are fundamentally different from those found invalid in Lilly and Fiers, The subject matter of 
the present claims is defined in terms of the chemical structure of SEQ ID NO:l or SEQ ID 
NO:2. The courts have stressed that structural features are important factors to consider in a 
written description analysis of claims to nucleic acids and proteins. In addition, the genus of 
polynucleotides defined by the present claims is adequately described, as evidenced by Brenner 
et al. Furthermore, there have been remarkable advances in the state of the art since the Lilly and 
Fiers cases, and these advances were given no consideration whatsoever in the position set forth 
by the Final Office Action. 

Issue Two: Double Patenting Rejection 

Claim 8 was rejected under the judicially created doctrine of obviousness-type double 
patenting over Claims 3-4 of U.S. Patent No. 5,817,497. While not conceding the propriety of 
the Examiner's position, Appellants are willing to submit a Terminal Disclaimer with respect to 
U.S. Patent No. 5,817,497 in the interest of expediting prosecution of the subject application, 
upon indication that the application is otherwise allowable. Therefore, it is requested that the 
Board indicate that the subject application will be allowable upon submission of such a Terminal 
Disclaimer. 



Doc No. 117560 



11 



09/784,739 



Docket No.: PF-0162-3 DIV 

(9) CONCLUSION 

Appellants request that the rejections of the claims on appeal be reversed for at least the 
above reasons. 

If the USPTO determines that any additional fees are due, the Commissioner is hereby 
authorized to charge Deposit Account No. 09-0108. 
This brief is enclosed in triplicate. 



Respectfully submitted, 
INCYTE CORPORATION 



Date: '^-^^-cJi.-'rJt^ l(p,2ooi ^^-k^^-^-co^ ^T. ^OjCAji^ 

Susan K. Sather 
Reg. No. 44,316 

Direct Dial Telephone: (650) 845-4646 

Customer No.; 27904 

3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 



Enclosures: 

1. Brenner et al. ("Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078 



Doc No. 11 7560 



12 



09/784,739 



Docket No.: PF-0162-3 DIV 

APPENDIX - CLAIMS ON APPEAL 

2. (Previously Presented) An isolated polynucleotide encoding a polypeptide 
comprising an amino acid sequence selected from the group consisting of: 

a) an amino acid sequence of SEQ ID N0:1, and 

b) a naturally-occurring amino acid sequence having at least 90% sequence identity 
to the sequence of SEQ ID NO: 1 over the entire length of SEQ ID NO: 1. 

3. (Original) A recombinant polynucleotide comprising a promoter sequence 
operably linked to a polynucleotide of claim 2. 

4. (Original) A cell transformed with a recombinant polynucleotide of claim 3. 

8. (Previously Presented) An isolated polynucleotide comprising a sequence 
selected from the group consisting of: 

a) a polynucleotide sequence of SEQ ID NO:2, 

b) a naturally-occurring polynucleotide sequence having at least 90% sequence 
identity to the sequence of SEQ ID NO:2, over the entire length of SEQ ED NO:2, 

c) a polynucleotide sequence completely complementary to a), 

d) a polynucleotide sequence completely complementary to b) and 

e) a ribonucleotide equivalent of a)-d). 



Doc No. 117560 



13 



09/784,739 



Docket No.: PF-0162-3 DIV 



Response Under 37 C.F.R, 1,116 - Expedited Procedure 

Examining Group 1645 

Certificate of Mailing 

I hereby ceri^jaj^ftfm corresponds is being deposited with the United States Postal Service as first class mail in an envelope addressed to- 
Mail Stop: Atogaipnd^-pa|^ for Patents, P.O. Box 1450, Alexandria, VA 22313-1450 on December 16. 2001 . 
U ^— V- Printed: Lvza Finuliar 

INTHE UNITED STATES PATENT AND TRADEMARK OFFICE 
BEFORE THE BOARD OF PATENT APPEALS AND INTERFERENCES 

In re Application of: Goli et al. 

Title: A NOVEL GLUTATHIONE S -TRANSFERASE 

Serial No.: 09/784,739 Filing Date: February 14, 2001 

Examiner: Hines, J.A. Group Art Unit: 1645 

Mail Stop Appeal Brief-Patents 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 

BRIEF ON APPEAL 

Sir: 

Further to the Notice of Appeal filed October 22, 2003, and received by the USPTO on 
October 27, 2003, herewith are three copies of Appellants' Brief on Appeal. Authorized fees 
include the $ 320.00 fee for the filing of this Brief. 

This is an appeal from the decision of the Examiner finally rejecting claims 2-4 and 8 of 
the above-identified application. 

(1) REAL PARTY IN INTEREST 
The above-identified application is assigned of record to Incyte Pharmaceuticals, Inc., 
(now Incyte Corporation, formerly known as Incyte Genomics, Inc.) (Reel 8620, Frame 
0308) which is the real party in interest herein. 



Doc No. 117560 



1 



09/784,739 



Docket No.: PF-0162-3 DIV 
(2) RELATED APP EALS AND TNTERFERENrF.9 
Appellants, their legal representative and the assignee are not aware of any related 
appeals or interferences which will directly affect or be directly affected by or have a bearing c 
the Board'^ decision in the instant appeal. 

(3) STATUS OF THE. PT ATM^ 
Claims rejected: Claims 2-4 and 8 ' 
Claims allowed: (none) 
Claims canceled: Claims 9 and 21-24 
Claims withdrawn: Claims 1, 5-7, 10-20^ 

Claims on Appeal: Claims 2-4 and 8 (A copy of the claims on appeal, as amended, ci 
be found in the attached Appendix). 

(4) STATUS OF AMENDMENTS AFTER FTNAT. 
There were no amendments submitted after Final Rejection. 

(5) SUMMARY OF TRF INVENTION 
Appellants' invention is directed, inter alia, to polynucleotides encoding polypeptides 
having homology to Alpha class glutathione s-transferases from normal human liver, pGTH2, 
(GI 825605)1, human hepatoma, Al-1, (GI 259141), and mouse lung, GST 5.7, (GI 193710). 
These polynucleotides have a variety of utilities, in particular in expression profiling, and in 
particular for diagnosis of conditions or diseases characterized by expression of SEQ ID NO: 1 
(HOST), for toxicology testing, and for drug discovery (see the Specification at, e.g., page 34, 
line 18 through page 37, line 11). As described in the Specification (page 11, line 28 through' 
page 12, line 22): 

Nucleic acids encoding the human HGST of the present invention were 
first Identified in Incyte Clone 1553079 from the bladder tumor cDNA library 
(BLADTUT04) through a computer-generated search for amino acid sequence 



Inder 35 U S c"?i? firft n^r^"' °k ^'^^ ^' '^t"^'^^' """'P^'^" ^^J^'^""" °f ^^'^^^ 2-4 and 8-9 

under J5 U.S.C. 1 12. first paragraph, as containing subject matter which was not described in the soecification in 

was fileT L'h '° ^'''yr^'^ ^"^'''^^ ^^"^-"^ »hat the inventor(s at the dme hS 
Td Math^^^^^^^^^^ "r"'r-: ^PP''""'^ '^'^^ C'-- 9 --canceled in the RespSr 

L.hH P^^^ ^ ^'^''O" Summary), states that "claim(s) 1, 5-7 and 10-24 is/are 

M 2(SrIZcr;fr"^^^^^ Applicants note that Claims 21-24 were canceled onihe t'ransmitfa filed F bLry 
Dor Nn 1 1 7^An ^"""""'^ inadvertently listed Claims 2 1-24 as being withdrawn claims. 

DOC No. 117560 2 09/784,739 



Docket No.: PF-0162-3 DIV 



alignments. A consensus sequence, SEQ ID N0:2, was derived from the 
following overlapping and/or extended nucleic acid sequences: Incyte Clones 
1553079/ BLADTUT04, 1328546/ PANCNOT07, 1422059/ KIDNNOT09 and 
2188683/ PROSNOT26. 

In one embodiment, the invention encompasses the novel human 
glutathione s-transferase, a polypeptide comprising the amino acid sequence of 
SEQ ID N0:1, as shown in Figures lA, IB, and IC. HOST is 222 amino acids ir 
length and has chemical and structural homology with two human Alpha GSTs 
pGTH2 (GI 825605; SEQ ID N0:3), and Al-1 (GI 259141), and a mouse Alph'a 
GSH, GST 5.7 (GI 193710). In particular, HGST shares 57% overall identity 
with each of the two human GSTs and 59% identity with the mouse GST. In 
addition, various amino acid residues found to be essential for the catalytic 
activity and substrate binding of GSTs are conserved in HGST and in the other 
three GST molecules. These residues are: Y9, R13, R20, E32, Q67, T68, R69, 
E97, DlOl, E104, and R131. Only residues E97 and E104 are not found in the' 
mouse GST. E32 and E97 form salt bridges with R20 and R69, respectively, and 
these salt bridges or residues are thought to be important in structural stability of 
the GST molecule and may be important for catalysis. Y9 is essential for 
catalysis by facilitating ionization of GSH. Residues Q67, T68, DlOl, E104, and 
R131 are important for the binding of GSH. As illustrated by Figures 3, 4, 5, and 
6, HGST and the three Alpha GSTs have rather similar hydrophobicity plots. 
Figures 7, 8, and 9 show the isoelectric point analyses for HGST, pGHT2, and 
Al-1. The pl values of 8.8, 9.0, and 9.3, respectively; fall within the range 
characteristic of Alpha GSTs. In addition to bladder tumor, partial transcripts of 
the cDNA encoding HGST are found in fetal tissues (kidney and pancreas) and in 
prostate tissue adjacent to prostate cancer. 



(6) ISSUES 

1 . Whether the polynucleotides of Claims 2-4 and 8 meet the written description 
requirement of 35 U.S.C. § 1 12, first paragraph. 

2. Whether the polynucleotides of Claim 8 are covered by the judicially created 
doctrine of obviousness-type double patenting over Claims 3-4 of U.S. Patent No. 5,817,497. 



(7) GROUPING OF THE CLAIMS 

As to Issue 1 

This issue pertains to Claims 2-4 and 8. 
As to Issue 2 

This issue pertains to Claim 8. 
Doc No.l 17560 3 09/784,739 



Docket No.: PF-0 162-3 DIV 



(8) APPELLANTS' ARGUMENTS 
Issue One: Written Description Rejection 

Claims 2-4 and 8 have been rejected under the first paragraph of 35 U.S.C. § 1 12 for 
alleged lack of an adequate written description. The Examiner alleges that the polynucleotides 
of Claim 2 encoding polypeptide variants, the polynucleotide variants of Claim 8, and the 
polynucleotides of Claim 8 complementary to or ribonucleotide equivalents of SEQ ID N0:2 and 
SEQ ID N0:2 variants are not adequately described. 

The Examiner alleges that "there is no disclosure of naturally occurring variants having at 
least 90% sequence identity to the sequence of SEQ ID N0:1 or 2 over the entire length of SEQ 
ID NO: 1 or 2 or complementary sequences within the family of enzymes." (Final Office Action, 
pages 7-8.) 

The Examiner ignores the claim limitations of "having at least 90% sequence identity of 
the sequence of SEQ ID NO:l over the entire length of SEQ ID N0:1" and "having at least 90% 
sequence identity to the sequence of SEQ ID N0:2, over the entire length of SEQ ID N0:2." 
The Examiner ignores the limitation that the claimed polynucleotides encode a naturally- 
occurring amino acid sequence, comprise a naturally-occurring polynucleotide sequence, are 
complementary to a naturally-occurring polynucleotide sequence, or are ribonucleotide 
equivalents thereof. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. 
1 12, first paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled 
in the art that, as of the filing date sought, he or she was in possession of the 
invention. The invention is, for purposes of the "written description" inquiry, 
whatever is now claimed. Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111 1117 
(Fed. Cir. 1991) 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for 
Examination of Patent Applications Under the 35 U.S.C. Sec. 112, para. 1", published January 5, 
2001 , which provide that: 

An applicant may also show that an invention is complete by disclosure of 
sufficiently detailed, relevant identifying characteristics which provide evidence 
that applicant was in possession of the claimed invention, i.e., complete or partial 
structure, other physical and/or chemical properties, functional characteristics 
when coupled with a known or disclosed correlation between function and 
structure, or some combination of such characteristics. What is conventional or 



Doc No. 117560 



4 



09/784,739 



Docket No.: PF-0162-3 DIV 

well known to one of ordinary skill in the art need not be disclosed in detail. If a 
skilled artisan would have understood the inventor to be in possession of the 
claimed invention at the time of filing, even if every nuance of the claims is not 
explicitly described in the specification, then the adequate description requirement 
is met. (citations omitted.) 

Thus, the written description standard is fulfilled by both what is specifically disclosed 
and what is conventional or well known to one skilled in the art. 

SEQ ID N0:1 and SEQ ID N0:2 are specifically disclosed in the application (see, for 
example, pages 50-52). The Specification states that HOST can be "obtained from any species, 
particularly mammalian, including bovine, ovine, porcine, murine, equine, and preferably 
human, from any source whether natural, synthetic, semi-synthetic, or recombinant." (e.g.. 
Specification, at page 5, lines 26-29, emphasis added). Variants of SEQ ID N0:1 are described, 
for example, at page 6, lines 6-14. In particular, the preferred, more preferred, and most 
preferred SEQ ID N0:1 variants ("one having at least 80%, and more preferably 90%, amino 
acid sequence similarity to the HOST amino acid sequence (SEQ ID NO: 1). . . one having at 
least 95% amino acid sequence similarity to SEQ ID N0:1"). are described, for example, at page 

12, lines 23-26. Variants of SEQ ID N0:2 are described, for example, at page 12, line 27 
through page 13, line 19. Incyte clones in which the nucleic acids encoding the human HOST 
were first identified and libraries from which those clones were isolated are described, for 
example, at page 11, line 28 through page 12, line 3, and page 39, lines 14-26 of the 
Specification. Chemical and structural features of HOST are described, for example, on page 12, 
lines 4-20. 

Complementary polynucleotide sequences are described, for example, at page 8, lines 5- 

13, page 9, lines 11-20, page 13, lines 26-28, page 28, lines 1-9, and page 46, lines 4-16. 
Ribonucleotide equivalents are described, e.g., at page 5, lines 11-14, and page 34, lines 18-20. 

Given SEQ ID NO:l, one of ordinary skill in the art would recognize a polynucleotide 
encoding a naturally-occurring variant of SEQ ID N0:1 having at least 90% sequence identity to 
SEQ ID N0:1. Given SEQ ID NO:2 one of ordinary skill in the art would recognize a naturally- 
occurring variant of SEQ ID N0:2 having at least 90% sequence identity to SEQ ID N0:2. The 
Specification describes how to use BLAST to determine whether a given sequence falls within 
the "having at least 90% sequence identity" scope. (Specification, page 41, line 28 through page 
42, line 13.) 



Doc No. 11 7560 



5 



09/784,739 



Docket No.: PF-0162-3 DIV 

The Examiner asserts that "there is no assay for determining naturally occurring 
sequences which support allowing one of skill to screen for such naturally occurring variants 
having at least 90% sequence identity to the sequence of SEQ ID NO: 1 or 2 over the entire 
length of SEQ ID NO: 1 or 2." (Final Office Action, page 7.) However, sequence information is 
not provided in a vacuum. Identification of the source of the sequence will typically allow one to 
determine if it is naturally-occurring. Also, attempted deceit to hide the source will not preclude 
infringement. 

The Examiner asserts that "Applicants have not taught" "any variant or mutant that has 

similar sequence identity yet has a different function" [than SEQ ID N0:1 or SEQ ID N0:2]. 

(Final Office Action, page 6.) However, the Specification teaches that: 

Also included within the scope of the present invention are alleles of the 
genes encoding HOST. As used herein, an "allele" or "allelic sequence" is an 
alternative form of the gene which may result from at least one mutation in the 
nucleic acid sequence. Alleles may result in altered mRNAs or polypeptides 
whose structure or function may or may not be altered. (Specification, page 
14, lines 14-18, emphasis added.) 

There simply is no requirement that the claims recite particular variant polypeptide, 
variant polynucleotide, complementary polynucleotide, or ribonucleotide equivalent 
polynucleotide sequences because the claims already provide sufficient structural definition of 
the claimed subject matter. That is, the polynucleotides encoding polypeptide variants are 
defined in terms of SEQ ID NO:l ("An isolated polynucleotide encoding a polypeptide 
comprising an amino acid sequence ... of a naturally-occurring amino acid sequence having at 
least 90% sequence identity to the sequence of SEQ ID NO: 1 over the entire length of SEQ ID 
NO:l."). The polynucleotide variants, complementary polynucleotides, and ribonucleotide 
equivalent polynucleotides are defined in terms of SEQ ID N0:2 ("An isolated polynucleotide 
comprising a sequence selected from the group consisting of: a) a polynucleotide sequence of 
SEQ ID N0:2,b) a naturally-occurring polynucleotide sequence having at least 90% sequence 
identity to the sequence of SEQ ID N0:2, over the entire length of SEQ ID N0:2, c) a 
polynucleotide sequence completely complementary to a), d) a polynucleotide sequence 
completely complementary to b) and e) a ribonucleotide equivalent of a)-d)."). 

Because the recited polypeptide variants are defined in terms of SEQ ID NO: 1, and the 
recited polynucleotide variants, complementary polynucleotides, and ribonucleotide equivalent 
polynucleotides are defined in terms of SEQ ID N0:2, the precise chemical structure of every 
polypeptide variant, every polynucleotide variant, every complementary polynucleotide, and 
DocNo.117560 6 09/784,739 



Docket No. : PF-0 1 62-3 DIV 

every ribonucleotide equivalent polynucleotide within the scope of the claims can be discerned. 
The Examiner's position is nothing more than a misguided attempt to require Applicants to 
unduly limit the scope of their claimed invention. Accordingly, the Specification provides an 
adequate written description of the recited polypeptide and polynucleotide sequences. 

I. The present claims specifically define the claimed genus through the recitation of 
chemical structure 

Court cases in which "DNA claims" have been at issue commonly emphasize that the 
recitation of structural features or chemical or physical properties are important factors to 
consider in a written description analysis of such claims. For example, in Fiers v. Revel, 25 
USPQ2d 1601, 1606 (Fed. Cir. 1993), the court stated that: 

If a conception of a DNA requires a precise definition, such as by 
structure, formula, chemical name or physical properties, as we have held, then a 
description also requires that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts 

have noted that the claims attempted to define the claimed DNA in terms of functional 

characteristics without any reference to structural features. As set forth by the court in 

University of California v. Eli Lilly and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as 
"vertebrate insulin cDNA" or "mammalian insulin cDNA," without more, is not 
an adequate written description of the genus because it does not distinguish the 
claimed genus from others, except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
strtjctural features, has been a common basis by which courts have found invalid claims to DNA. 
For example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written 
description requirement the following claim of U.S. Patent No. 4,652,525: 

1 . A recombinant plasmid replicable in procaryotic host containing within 
its nucleotide sequence a subsequence having the structure of the reverse 
transcript of an mRNA of a vertebrate, which mRNA encodes insulin. 



count: 



In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following 



Doc No. 11 7560 



7 



09/784,739 



Docket No.: PF-0162-3 DIV 

A DNA which consists essentially of a DNA which codes for a human 
fibroblast interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an 
adequate written description of the DNA of the count because that application mentioned a 
potential method for isolating the DNA. The Revel priority application, however, did not have a 
description of any particular DNA structure corresponding to the DNA of the count. The court 
therefore found that the Revel priority application lacked an adequate written description of the 
subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional 
characteristics and were found not to comply with the written description requirement of 35 
U.S.C. §112; i.e., "an mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA 
which codes for a human fibroblast interferon-beta polypeptide" in Fiers. In contrast to the 
situation in Lilly and Fiers, the claims at issue in the present application define polynucleotides 
in terms of chemical structure, rather than on functional characteristics. For example, the 
"variant language" of independent Claims 2 and 8 recites chemical structure to define the 
claimed genus: 

2. An isolated polynucleotide encoding a polypeptide comprising an amino 

acid sequence selected from the group consisting of . . 

b) a naturally-occurring amino acid sequence having at least 90% 
sequence identity to the sequence of SEQ ID NO: 1 over the entire leneth 
ofSEQIDNO:l. 

8. An isolated polynucleotide comprising a sequence selected from the group 
consisting of . . 

b) a naturally-occurring polynucleotide sequence having at least 90% 
sequence identity to the sequence of SEQ ID NO: 2, over the entire length 
of SEQIDN0:2. . . 

The Final Office Action asserts that "the nucleic acid itself is required" to describe the 
claimed polynucleotides (Final Office Action, page 8). However, Fiers does not state that the 
nucleic acid itself is required, but that the claimed sequence must be defined by more than 
merely functional properties, such as coding for human erythropoietin. ("Conception of a DNA, 
like conception of any chemical substance, requires a definition of that substance by other than 
its functional utility" Fiers v. Revel, 25 USPQ 2d 1601 at 1604). Fiers makes clear that chemical 
and physical properties may be used to define a claimed sequence. ("Conception of a substance 



Doc No. 11 7560 



8 



09/784,739 



Docket No.: PF-0 162-3 DIV 

claimed per se without reference to a process requires conception of its structure, name, formula, 
or definitive chemical or physical properties" Fiers v. Revel, at 1605). 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invaHd in laiy and Fiers. The subject matter of the 
present claims is defined in terms of the chemical structure of SEQ ID NO: 1 and SEQ ED N0:2. 
In the present case, there is no reliance merely on a description of functional characteristics of 
the polynucleotides recited by the claims. In fact, there is no recitation of functional 
characteristics. Moreover, if such functional recitations were included, it would add to the 
structural characterization of the recited polynucleotides. The polynucleotides defined in the 
claims of the present application recite structural features, and cases such as Lilly and Fiers stress 
that the recitation of structure is an important factor to consider in a written description analysis 
of claims of this type. By failing to base its written description inquiry "on whatever is now 
claimed," the Final Office Action failed to provide an appropriate analysis of the present claims 
and how they differ from those found not to satisfy the written description requirement in Lilly 
and Fiers. 



II. The present claims do not define a genus which is highly diverse 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
highly diverse. Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference 
by Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified 
distant evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078; Reference 
No. 1.) Through exhaustive analysis of a data set of proteins with known structural and 
functional relationships and with <90% overall sequence identity, Brenner et al. have determined 
that 30% identity is a reliable threshold for establishing evolutionary homology between two 
sequences aligned over at least 150 residues. (Brenner et al., pages 6073 and 6076.) 
Furthermore, local identity is particularly important in this case for assessing the significance of 
the alignments, as Brenner et al. further report that >40% identity over at least 70 residues is 
reliable in signifying homology between proteins. (Brenner et al., page 6076.) 

The present application is directed, inter alia, to glutathione s-transferase proteins related 
to the amino acid sequence of SEQ ID NO:l. In accordance with Brenner et al, naturally- 
occurring molecules may exist which could be characterized as glutathione s-transferase proteins 
DocNo.117560 9 09/784,739 



Docket No.: PF-0162-3 DIV 
and which have as little as 40% identity over at least 70 residues to SEQ ID N0:1. The "variant 
language" of the present claims recites, for example, polynucleotides encoding a polypeptide 
comprising "a naturally-occunring amino acid sequence having at least 90% sequence identity to 
the sequence of SEQ ID N0:1 over the entire length of SEQ ID N0:1." This variation is far less 
than that of all potential glutathione-s-transferase proteins related to SEQ ID NO:l. i.e., those 
glutathione s-transferase proteins having as little as 40% identity over at least 70 residues to SEQ 



IDN0:1. 



And, in any case, the "function" of the claimed polynucleotides and polypeptides 
encoded by the claimed polynucleotides is immaterial to their use in toxicology testing, as 
described in the Specification and as well known in the art (see the Specification at, e.g., page 
34, line 18 through page 37, line 11; Bedihon Declaration filed March 24, 2003, e.g., f 15). 

Jh^ti^l^^\u"rV ? t'™' P'*^^"' ^"'•ther advanced than at 

the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to 
comply with the written description requirement of 35 U.S.C. §112. The '525 patent claimed the 
benefit of priority of two applications. Application Serial No. 801,343 filed May 27, 1977 and 
Application Serial No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the 
benefit of priority of an Isradi application filed on November 21, 1979. Thus, the written 
description inquiry in those cases was based on the state of the art at essentially at the "dark 
ages" of recombinant DNA technology. 

The present application has a priority date of November 26, 1996. Much has happened in 
the development of recombinant DNA technology in the 17 or more years from the time of filing 
of the applications involved in Lilly and Fiers and the present application. For example, the 
technique of polymerase chain reaction (PGR) was invented. Highly efficient clonmg and DNA 
sequencing technology has been developed. Large databases of protem and nucleotide 
sequences have been compiled. Much of the raw material of the human and other genomes has 
been sequenced. With these remarkable advances one of skill in the art would recognize that, 
given the sequence information of SEQ ID N0:1 and SEQ ID N0:2, and the additional extensive 
detail provided by the subject application, the present inventors were in possession of the recited 
polypeptide variants, polynucleotide variants, complementary polynucleotides, and 
ribonucleotide equivalent polynucleotides at the time of filing of this application. 

DocNo.117560 m 

1" 09/784,739 



Docket No.: PF-0162-3 DIV 

IV. Summary 

The Final Office Action failed to base its written description inquiry "on whatever is now 
claimed." Consequently, the Final Office Action did not provide an appropriate analysis of the 
present claims and how they differ from those found not to satisfy the written description 
requirement in cases such as LUly and Fiers. In particular, the claims of the subject application 
are fundamentally different from those found invalid in Lilly and Fiers. The subject matter of 
the present claims is defined in terms of the chemical structure of SEQ ID NO. l or SEQ ID 
N0:2. The courts have stressed that structural features are important factors to consider in a 
written description analysis of claims to nucleic acids and proteins. In addition, the genus of 
polynucleotides defined by the present claims is adequately described, as evidenced by Brenner 
et al. Furthermore, there have been remarkable advances in the state of the art since the Lilly and 
Fiers cases, and these advances were given no consideration whatsoever in the position set forth 
by the Final Office Action. 



Issue Two: Double Patenting Rejection 

Claim 8 was rejected under the judicially created doctrine of obviousness-type double 
patenting over Claims 3-4 of U.S. Patem No. 5,817,497. While not conceding the propriety of 
the Examiner's position. Appellants are willing to submit a Terminal Disclaimer with respect to 
U.S. Patent No. 5,817,497 in the interest of expediting prosecution of the subject application, 
upon indication that the application is otherwise allowable. Therefore, it is requested that the 
Board indicate that the subject application will be allowable upon submission of such a Terminal 
Disclaimer. 



Doc No. 117560 



11 



09/784,739 



Docket No.: PF-0 162-3 DIV 

(9) CONCLIJSinN 

Appellants request that the rejections of the claims on appeal be reversed for at least the 

above reasons. 

If the USPTO determines that any additional fees are due, the Commissioner is hereby 
authorized to charge Deposit Account No. 09-0108. 
This brief is enclosed in triplicate. 



Respectfully submitted, 
INCYTE CORPORATION 



Date: 



2ooi 



3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 



Susan K. Sather 
Reg. No. 44,316 

Direct Dial Telephone: (650) 845-4646 

Customer No.; 27904 



Enclosures: 

idPntififn hT' T ^"^'^^'^'"8 ^^^"^"^^ comparison methods with reliable structurally 
Identified distant evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-f 



Doc No. 117560 



12 



09/784,739 



Docket No.: PF-0 162-3 DIV 

APPENDIX - CLAIMS ON APPF AT 

2. (Previously Presented) An isolated polynucleotide encoding a polypeptide 
comprising an amino acid sequence selected from the group consisting of: 

a) an amino acid sequence of SEQ ID NO: 1 , and 

b) a naturally-occurring amino acid sequence having at least 90% sequence identity 
to the sequence of SEQ ID NO: 1 over the entire length of SEQ ID NO: 1 . 

3. (Original) A recombinant polynucleotide comprising a promoter sequence 
operably linked to a polynucleotide of claim 2. 

4. (Original) A cell transformed with a recombinant polynucleotide of claim 3. 

8. (Previously Presented) An isolated polynucleotide comprising a sequence 
selected from the group consisting of: 

a) a polynucleotide sequence of SEQ ID NO: 2, 

b) a naturally-occurring polynucleotide sequence having at least 90% sequence 
identity to the sequence of SEQ ID NO:2, over the entire length of SEQ ID NO:2, 

c) a polynucleotide sequence completel y complementary to a), 

d) a polynucleotide sequence completely complementary to b) and 

e) a ribonucleotide equivalent of a)-d). 



Doc No. 117560 



13 



09/784,739 



Docket No.: PF-0162^3 DIV 
USSN: 09/784.739 
Ret No. 1 of 1 



Proc. Natl. Acad. Set USA 

Vol. 95, pp. 6073-6078, May 1998 

Biochemistry 



Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships 



Steven E. BRENNER*tt, Cyrus Chothia*, and Tim J. P. Hubbard§ 

•MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and ^Sanger Centre, Wellcome Trust Genome Campus, Hinxton, 
Cambs CBIO ISA, United Kingdom . 

Communicated by David R, Davies, National Institute of Diabetes, Bethesda, MD, March 16, 1998 (received for review November 12, 1997) 



ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are Juiown 
reliably from their structures and functions, as described In 
the SCOP database [Murzin, A. Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. MoL Biol 247, 536^40], The evalua- 
tion tested the programs blast [Altschul, S. F., Gish, W., 
MUler, W., Myers, E. W. & Lipman, D. J. (1990). /. MoL BioL 
lis, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Peareon, W. R. & 
Upman; D. J. (1988) Proc, Natl Acad, Sci, USA 85, 2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) /. MoL 
BioL 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of SSEARCH and FASTA are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins, whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with- 20^0% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identiHed may be used with conHdence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method*s central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 



The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement*' in 
accordance with 18 U.S.C. §1734 solely to indicate this fact. 

© 1998 by The National Academy of Sciences 0027-8424/98/956073-6$ 2.00/0 
PNAS is available online at http://www.pnas.org. 



Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
BLAST (1) have changed, and wu-blastz (2) — ^which produces 
gapped alignments — has become available. The latest version 
of FASTA (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge; 
For example, there has been no published assessment of ' 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: ; 
in an absolute sense, how well does pairwise- sequence com- 
parison work? That is, what fraction of honaologous proteins 
can be detected using modern database searching methods? 

In this, work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database' (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison niethods. TTie most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup =1). 
Pearson also considered different parameters for each bf these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PiR 

Abbreviation: EPQ, errors per query. 

t Present address: Department of Structural Biology, Stanford Uni- 
versity, Fairchild Building D-109, Stanford, CA 94305-5126 

*To whom reprints requests should be addressed, e-mail: brenner@ 
hyper.stanford.edu. 



6073 



6074 Biochemistry: Brenner et al. 



Proc, Natl Acad. ScL USA 95 (1998) 



superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly "better than FASTA, which was^ih turn more effective 
than BLAST. ' , . . 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of BLAST and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used PROsrre (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. ? 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' .evaluations of . sequence- com- 
parison, the correct results were effectively unknown. This is * 
because the superfamilies in pir and PROSITE are principally 
created -by using the same sequence comparison methods 
which: are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly ^ 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly ' 
homologous, but PIR places them in different superfamilies. 
The problem is widespread: each superf amily in PIR 48.00 with ^ 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com-* 
parisoii; Rather than comparing different sequence compari- 
son algorithms^ their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure: A result of this analysis 
was the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
ahgnments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were priricipally oriented towiard eval- 
uating alignment accuracy rather than homology detection!) ' 

A general-solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
impleihented analytically in the BLAST program using the 
Karlin and* Altschiil statistics (22, ■ 23) ' and' empirical ' ap- ' 
proaches have been recently added to FASTA and SSEARCH. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24)\' However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there havQ been no large rigorous experiments* on biolog- 
ical data to deterhiine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detectipn. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two jproteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with'^ the ^comprehensive evolutionary classification in * 
the scpp database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scoP database 
uses structural information to recogiiize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be i-ecognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. . , 

Frona SCpP, . we ; extracted the sequences,, of domains of., 
proteins in the Protein Data Bank (pdb) (30) and . created two 
databases. One((l»pB9pb7B) has domains, which we're all <S[p% 
identical to any other, whereas (pdb40D-b) had those <AOfq 
' . identical. The datab^es were created by first sorting .all ^ 
^ , protein domains in SCOP by their quality and making a list. The"^ 
J highest qi^ality , domain was selected for inclusion in ; thev, 
database and reinpyed from the list. Also removed from the list 
(and discarded) were all other doinairis above the threshold 
level of identify 'to the selected domain. This process was 
repeated until the 'list was erapfyi The PDB40D-B database ; 
contains 1,323 domains, which have 9,044 ordered pairs oif ' 
distant relationships, or •«0.5% of the total 1,749,006 or dereid 
pairs. In PDB9ob-B, the 2,079 domains have 53,988 rcilatibn-' 
ships, representing' 1.2% of all pairs. Low complexify regions 
of sequence can achieve spurious, high scores; so 
. masked in both databases by processing with the SEG pfojgram ' 
(27) using recdmmerided parameters: 12 1.8 2.0. The databases 
used in this paper are available from'http://sss;stahf6rd.edu/' 
sss/, and databases derived from the current version of scop- 
may be found at http://scop:mrc-lmb.cam.ac:uk/scop/. , * 
Analyses from both databases were generally cohsi^tenti but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of; a small number of 
families; (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general., 

Assessment Data and Procedure. Our assessment of se- 
. quence compiarison may be divided into four different major 
categories of tests. First, using just a single sequence compar- 
. ison algorithm at a time, we evaluated the effectiveness of, 
. different scoring schemes. Second, we assessed the reliability 
of scoring prpcedures,' including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of paiirwise 
. sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified; homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP, and WU- 
BLAStZ (2), version 2.0a 13MP. Also assessed was the FASTA 
package, version 3.0t76 (3), which provided FASTA and the 
SSEARCH implementation of Smith-Waterman (8). For 
SSEARCH and pasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for BLAST and wu-blastz. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 



Biochemistry: Brenner e^ fl/. 

Smllh-Waterman Scoring Schemes (PDB40M) 



Proc. Natl Acad. ScL USA 95 (1998) 6075 
Smith-Waterman Scoring Schemes (POB90D-B) 




0.1 0.15 

Coverage. 



0.25 



■ 0.01 



0.001 




0.2 0.3 
Coverage 



Fig. 1. Coverage vs, error plots of different scoring schemes for ssearch Smith-Waterman. {Ay Analysis of pdb40D-b database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs, error plot, the j: axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. pdb40D-b contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of , EPQ. Because there are 1,323 queries made in the pdb40D-b all-vs,-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal methoci would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the alighed region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region ak a percentage of the average length of the query and target proteins. The hssp equation (17) is H 290.15/-o-^*2 where 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for l > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with. all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defmed as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ); an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by. the number of queries., Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) biit 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 

Percent Identity of Unrelated Proteins (PDB900-B) 




point.ptQt^ theitengtt) .and. . 
percent identity of ah alignment 
.t)etw£ieo.two.unrelated proteins . 



Hemoglobin ^hain (Ihdsb) . Cellulase E2 (1tml_) 

1 hd«b GKVDVDWGAQALGR- - LLVVYPVn^RPFQHTONI^SAGAVimNPKVKAHGKRVLDAFTQGLKH 
1 tml_ GOVDALMSAAQAAGKIPILWYNAPGR- - -DCGNHSSGGA PSHSAY-RSWIDEFAAGLKM 

Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin /3-chain (PDB code Ihds chain b, ref. 38, Left) and cellulase E2 
(PDB code Itml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
RASMOL (40), 




100 

Alignment length 



200 



Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
SSEARCH is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the HSSP threshold (though it is intended to be applied 
with a different matrix and parameters). 



6076 Biochemistry: Brenner et al 



Proc. Natl. Acad. Sci USA 95 (1998) 



Reltabillty of Statistical Scores (PDB90D-B) , 




*= ' — - — ' . ' ■ ■ -* , 

0.001 0.01 0.1 " 1 10 

Errors Per Query 

Fig. 4. Reliability of statistical scores in ppB90D-B: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
PASTA, whereas P-values are shown for blast and wu-blast2. If the 
scoring were perfect, then the number of errors per query and the 
E*values would be the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
SSEARCH and fasta are, shown to. have good agreement with EPQ but 
underestimate the significance slightly... blast and wu-blasT2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for . pdb40D-b were similar, to those for ppB90D-B 
despite, the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. < 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further,, it provides a clear indication of the confidence, that . 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs* estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 

.Sequence Comparison Algorltlims (PDB40D-B) 
I t I 1 '• 1 1 _^ J 



related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has beeln.long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. , 
Moreover, publications have indicated that 25% identity caii 
be used as a threshold (17, 36). We find that the^e thresholds, 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; tlius, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for ,; 
such incorrect matches are typically not significant. The prin- . 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. ; , 

From the pbB90D-B analysis in Fijg. 3, we learn that 30% 
identity is. a reliable threshold for. this database only, for;, 
sequence alignments of at least 150i residues. Because one 
umelated pair of proteins has 43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. ' : 

At a given reliability, scores based on peircentage identity 
detect just a frz^ction of the distant hbriiologs tound by 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length,, 
then a negligible number of distant homologs are detected. 
Use of the HSS? equation improves the value of percentage 
identity, but even this measure can find only 4% of all known ' 
homologs at 1% EPQ. In short, percentiage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because- the 
reliability of a bit score depends on the. lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Comparison Algorithms (PDB90D-B) 




0.001 



0.16 0.18 
Coverage 



0.22 



o 

3 

o 



0.01 



0.001 




Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values), (A) PDB40D-B database. In this analysis, the best method is the slow ssearch, which finds 189& of relationships 
at 1% EPQ. FASTA ktup = 1 and wu-blast2 are almost as good. (B) PDB90D-B database. The quick WU-BLAST2 program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 



Biochemistry: Brenner et al 



Proc, Natl Acad. Set USA 95 (1998) 6077 



likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scbres are not only powerful, but also 
easy to interpret. SSEARCH and FASTA show close agreement 
between statistical scores and actual number of errors per 
query '(Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences ^being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. ' / 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic thepry is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from BLAST, biit also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homolpgs and Comparison of Algo- 
rithms. The results in Fig. 5^, and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. BLAST, which 
identifies ? 15%, was the Worst performer, whereas FASTA 
ktup = 1 is nearly as effective as SSEARCH. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. SSEARCH is 25 times slower than BLAST and 6.5 times 
slower than FASTA ktup = 1. WU-BLAST2 is slightly faster than 
PASTA ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method , can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that, many 
relationships is wu-blasT2. Consequently, we infer that the 
differences between FASTA kup = 1, ssearch, and wu-blasT2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig: 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. SSEARCH with E-values can recognize >90% of the 
homblogous pairs with 30-40% identity. In this region, there 
are 30' pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by SSEARCH E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 



Distribution and Detection of Homologs (PDB40D-B) 



2500 




10 15 20 25 30 
Percentage Identity: In both 

Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs ppB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(SSEARCH with E-values) at 1% EPQ. The pdb40D-b database contains 
proteins with <4Q% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. / , 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related, 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
BLAST was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped BLAST, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and (ii) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by FASTA and SSEARCH give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by BLAST and WU-BLAST2 underestimate the true 



Table 1. Summary of sequence comparison methods with pdb40D-b 



Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at \% EPQ 


SSEARCH % identity: within alignment 


25.5 


>70% 


<0.1 


SSEARCH % identity: within both 


25.5 


34% 


3.0 


SSEARCH % identity: HSSP-scaled 


25.5 


35% (HSSP -1- 9.8) 


4.0 


SSEARCH Smith-Waterman raw scores 


25.5 


142 


10.5 


SSEARCH E-values 


25.5 


0.03 


18.4 


FASTA ktup = 1 E-values 


3.9 


0.03 


17.9 


FASTA ktup = 2 E-values 


1.4 


0.03 


16.7 


WU-BLAST2 P-values 


1.1 


0.003 


17.5 


BLAST P-values 


1.0 


0.00016 


14.8 


*Times are from large database searches with genome proteins. 



6078 Biochemistry: Brenner et ah 



Proc. Natl Acad. Sci. USA 95 (1998) 



extent of errors. Second,. ssearch; 'WU-BLAST2, and fasta 
ktup =• 1 perform best, though BLAST -and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. , 

The homologous proteins that are found by sequence com- 
parison'can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to findHhe large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed, here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.** 

** Additional ahdf updated information about this^ work, including 
supplementary figures, may be found at http://sss.stanford.edu/sss/: 



The authors are grateful to Drs. A. G. Murzip, M. Levitt, S. R. Eddy, 
and G., Mitchison for valuable discussion. S.E.B, was principally 
supported by a St , John's College (Cambridge, UK) Btenef actors* 
Scholarship >nd by the American Friends of Cambridge University. , 
S.E.B. dedicates l^is contribution to the memory of Rabbi Albert T. 
and Clara S. Bilgray. ' , , , . 

1. Altschul, S.,F., Gish, W., Miller, W., Myers, E. W. & Upman, 
D. J. (1990) /. MoL Biol. 215,, 403-410.. 

2. Altschul,, S.,F. & Gish, W. (1996) Methods Enzymoi 266, 460- 

480.; ,. [: ..■ ■ ,', . . 

3. Pearson, W.R. & Upman, b. J. (1988) Proc. iVar/.^ca<f. 5c^ USA 
85,2444-2448. ... • . ■ 

4. Murzin, A. G., Brenner, S, E., Hubbard, T. & Chothia, C. (1995) 
/ MoL Biol. 247, 536-540.^ 

5. Brenner, S. E., Chothia, C, :Hubbard, T. J. P. & Murzin, A. G. 
(1996) A/c//wdi' E/tzymo/, 266, 635-643. . 

6. Pearsoii, W. R. (1991) Genomics 11, 635-650. 

7. Pearson, W. R. (1995) Protein Sci. 4, 1145-1160: 

8. Smith, T. F. & Waterman, M. S. (1981)/. Moi Biol. 147, 195-197. 

9. George, G., Hunt, L, T. & Barker, W. C. (1996) Methods 
EnzymoL 266, 41-59. ^ • 

10. Vogt, G., Etzoid, T. & Argos, P. (1995)/. Mol. Biol. 249, 816-831. 

11. Henikoff, S. & Henikbff, J. G. (1993) Proteins 17, 49-61. 

12. Bairoch, A. & Apweiler, R. (1996) Nucleic Acids Res. 24, 21-25. 

13. Bairoch, A., Bucher, P. & Hofmahn, K. (1996) Nucleic Acids Res. . 
24,189-196. 

14. Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl Acad. Sci. USA 
89, 10915-10919. ' 

15. Dayhoff, M., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of 
Protein Sequence and Structure, ed. Dayhoff, M. (National Bio- 



medical Research Foundation, Silver Spring, MD), Vol. 5, Suppl. 
3, pp. 345-352. 

16. Brenner, S. E. (1996) Ph.D. thesis. (University of Cambridge, 

UK).,. 

17. Sander, C & Schneider, R. (1991) Proteins 9, 56-63. 

18. Johnson, M. S. & Overington, J. P. (1993) /, MoL BioL 233, 
716-738. 

19. Barton, G. J. & Sternberg, M. J. E, (1987) Protein Eng. l, 89-94. 

20. Lesk, A. M,, Levitt, M. & Oiothia, C. (1986) Protein Eng. 1, 
77-78. . : . 

21. Arratia, R., Gordon, U- & M, W. (m6)Ahn, Stat, 14, 971-993. * 

22. Karlin, S. & Altschul, S. F, (1990) Proc. Natl Acad. Sci USA 87, 
2264-2268. 

23. Karlin, S. & Altschul, S. F. (1993) Proc. Natl Acad, Sci USA 90, - 
5873-5877. 

24. Altschul, S. R, Boguski, M. S., Gish, W. & Wootton, J. C. (1994) i, 
Nat. Genet, 6, 119-129. 

25. Pearson, W. R. (1996) Methods EnzymoL 266, 227-258. 

26. Upman, D. J., Wilbur, W. J,, Smith, T. F. & Waterman, M. S. 
(1984) Nucleic Acids Res. 12, 215-226. 

27. Wootton, J. C. & Federhen, S. (1996) Methods Enzymoi 166^ 
554-57L , - 

28. Waterman, M. S. & Vingron, M. (1994) Stat. Science 9, 367-381 : ' ' 

29. Perutz, M. F., Kendrew, J. C. & Watson, R C. (1965) / Mol Biol 
13,669-678. ; \ " 

30. Abola, E. E., Bernstein, F. C, Bryant, S. H./ Koetzle, T. F. & 
Weng, J. (1987) in Crystallographic Databases: Infotmatioh Con- 
tent, Software Systems, Scientific Applicdtioris, eds.- Allen, F. H., 
Bergerhoff, G. & Sievers, R. (Data Coram. Intl. Union Crystal- ■ 
logr., Cambridge, UK), pp. 107-132. ■> , 

31. Brenner, S, E., Chothia, C. & Hubbard, T. J. P. (1997) Curri Opin. 
Struct: Biol 7, 369-376. i.' - : : - 

32. Orengo, C, Michie, A., Jones S, Jones D. T, Swindells M B. &. 
Thornton, J. (1997) Structure (London) 5, 1093-:1108. 

33. Zweig, M. H. & Campbell, G. (1993) Clin. Chem. 39, 561-577. 

34. Gribskov, M. & Robinson, N. L (1996) Comput. Chem 20, 25-33, , 

35. Fitch, W. M. (1966) / Afd/. B/o/: 16, 9-16. 

36. Chung, S. Y, <& Subbiah, S. (1996) Structure (London) 4y 1123- 

1127. . ' , ; " 

37. Altschul, S. F., Madden, T. L., Schaffer, A. A.,;Zhangi J., Zhang, 
Z., Miller, W. & Lipman, D. J. (1997) Nucleic Acids Res. 25, 
3389-3402. 

38. Girling, R., Schmidt, W., Jr, Houston, T, Amma, E. & Huisnian,' 
T. (1979) / Mo/. Bio/. 131, 417-433. 

39. Spezio, M., Wilson, D. & Karplus, P. (1993) Biochemistry 32, 
9906-9916 

40. Sayle, R. A. & Milner-White, E. J. (1995) Trends Biochem Sci 
20,374-376. 



Docket No.: PF-0162-3 DIV 
lISSN:09/784.73'9" 
Rcf. No. 1 of 2 



Proc. Natl Acad, Set. USA 

Vol. 95, pp. 6073-6078, May 1998 

Biochemistry 

Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships 

Steven E. BRENNER*tt, Cyrus Chotoia*, and Tim J. P, Hubbard§ 

•MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and ^Sanger Centre, Wellcome Trust Genome Campus, Hinxton, 
Cambs CBIO ISA, United Kingdom . : 

- Communicated by David R Daynes, National Institute of Diabetes, Bethesda, MD, March. 16, 1998 (received for review November 1% 1997) 



ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G*, Brenner; S. Hubbard, T. 
& Chothia C. (1995) /. MoL Biol 247, 536-540], Th^ evalua- 
tion tested the programs blast [Altschul, S. F., Gish, W., 
MUler, W., Myers, E. W. & Lipman, D. J. (1990). /. MoL BioL 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Pearsbri, W. R. & 
Upman, D. J. {19SSj Proc, NatL Acad, Sti. 85, 2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) /. MoL 
BioL 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The £-value statistical scores of ssearch and fasta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude. SSEARCH, fasta ktup == 1, and wu-blastz perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. Howeyer, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 



The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement** in 
accordance with 18 U.S.C. §1734 solely to indicate this fact. 

© 1998 by The National Academy of Sciences 0027-8424/98/956073-6$ 2.00/0 
PNAS is available online at http://www.pnas.org. 



Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modem versions . 
of programs commonly used. For example, parameters, in. 
BLAST (1) have changed, and WU-BLAST2 (2) — ^which produces 
gapped alignments — has become available. The latest version 
of FASTA (3) previously tested was 1.6, but the current release ' 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on. large data- 
bases of real proteins. Moreover, the different scoring schemes ; . 
commonly in use have not been compared. ' ' 

Beyond these issues, there is a more fundamental question:: , . , 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins r; 
can be detected using modem database searching niethods? : 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental. difficulties. that have hin- 
dered assessment, of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database (4), which, 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- , 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be useid to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence coniparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared ^ 
the three most commonly used programs. Of these, the Smith- > 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. . 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 



Abbreviation: EPQ, errors per query. 

tPresent address: Department of Structural Biology, Stanford Uni- 
versity, Fairchild Building D-109, Stanford, OA 94305-5126 

tTo whom reprints requests should be addressed, e-mail: brenner® 
hyper.stanford.edu. 



6073 



6074 Biochemistry: Brenner et al 



Proc, Natl Acad. Sci. USA 95 (1998) 



super families. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fasta,- which was in turn rbbre effective 
than BLAST. ' 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of BLAST and FASTA. Their test with blast 
considered the ability to detect homologs above a predeter- - 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used PROSltE (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com^ 
parison, the correct results were effectively unknown. This is ' ' 
because the superfamilies in PIR and prosite are principally 
created by using the same sequence comparison methods 
which are ? being evaluated: Interdependency of -data land 
methods creates a "chicken and egg" problem, and means for 
example, that new methbds would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but PIR places; them in different superfamilies. 
The problem is widespread: each superfamily in PIR 48.00 with ■ 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfamilies (16).' ' : - 

To surmount these sorts of difficulties, Sander and Schhei- ' 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms; their work focused on determining. a , length- 
dependent threshold of percentage identity; above which all 
proteins would be of similar structure. A result of this analysis 
was the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments' require higher identity. (Other studies also have 
used structures (18-20); but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) basied on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- ' 
proaches have been recently added to FASTA and SSEARCH. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical ' trac- 
tability of statistical scores "is a' crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and empirically (see ref. 2 and references in 
ref. 24). However, all * large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been ho large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been " 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive. evolutionary classification in 
the SCOP database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate tlie periformahce 
of sequence comparison methods on real protein sequences 
whose relationships are known cotifidently. The scop database 
uses structural information to recognize distant homologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globfns or the immunoglobu- 
lins, would be recognized' as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. ; . ' ' . . .V . . - . . . 

From SCOP, we , extracted, the sequences .pf domains, of ^. 
proteins in. the Protein Data Bank (PDB) (30) and created two 
databases., One, (PDB90E>tB) has domains, which were all <90% 
identical to any,;Other, whereas (pbB40D-B) had those <40% 
identical. The databases were created by first sorting, all 
protein,dpmairis in scpp by their qiiality and making a list. The 
highest , quality domain was selected for inclusion in the . 
database and removed from the list Al?9 f ®^oy^*^ ^9^5 the jist 
(and discarded) were^ all other domains above the tfoeshold 
level of identity to the selected domain. TTiis/ process wa^ 
repeated until the list \yas einpity. The PDB4db-B database - 
contains 1,323' domaihis, which' l^^^^ ordered pairs of 

distant i-elationships, or «^0.5% of the tot^l,l,749,006 ordered 
pairs. In PDB90D-B, the 2,079 domains have 53^988 relation- 
ships, representing 1:2%' of all paii^s; Low coniplexity regions-; 
of sequence can achieve spurious Kigh scores, so these Were 
masked in both databasesby processing with the ^SEG program 
(27) using recommended parameters: 12 1.8 2.0; The databases : 
used in this paper are' available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb:cam.ac:uk/scop/. :^ 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the^ 
heavy overrepresehtation in the pdb of a small number of; 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Bccept where noted other- 
wise, the distant homolog results here, are from PDB40D-B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general,. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using jiist a single sequence compar- 
ison ajgprithm at a time, we evaluiated the effectiveness of 

, different scoring schemes. Second, we assessed the reliability 
of scorikig procedures, including an evaluation of the validity 

. of statistical scoring. Third, we compared sequence compari- 
son algorithms (iisihg the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databasies of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1); version 1,4.9MP, and wu- 
BLAST2 (2), version 2.0al3MP. Also assessed was the FASTA 
package, version 3.0t76 (3), which provided fasta and the 
SSEARCH implementation of Smith-Waterman (8). For 
SSEARCH and fasta, we used blosum45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for blast and WU-BLAST2. 

The "Coverage Vs. Error" Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 



Biochemistry: Brenner et ai 



Proc. Natl Acad, Sci. USA 95 (1998) 6075 



Smith-Waterman Scoring Schemes (PDB40D-B) 



SfTdttv-Waterman Scoring Schemes (PDB90D-B) 




. 0.1 0.15 
Coverage 



0.25 



i 0.01 



0.001 




0.2 0.3 
Coverage ' 



Fli 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waternian. {A) Analysis of PDB40D-B database. {B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statiistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the.i: axis indicates the fraction of 
all homologs in the database (known from structure) which have been deteaed. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. pdb40D-b contains a total of 9;044 homologs, so a score of 10% indicates 
identification of 904 relationships. The>' axis reports the number of EPQ. Because there are 1,323 queries made in the pdb40D-b all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig, 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner pf the graph, which corresponds to identifying many evolutionary relationships without 
selecting uiirelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/-0-5^2 where 
7 is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. Hie percentage identity Hssp-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the nuniber of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the . high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 



Percent Identity of Unrelated Proteins (PDB90D-B) 



60 




EaJ* noint.plQta tbe-tength and. , 
percent identity of a^ alignment 
,betw£eD.tyvo.unrelated4}coteins . 



Hemoglobin fxA\d\n (Ihdsb) Cellutase E2 (ItmlJ 

1hd«b 0KVDVDVyGAQAU3R--LLVVYPWTQRFFQ'HFGNI£SAGAVMra^ 

1tinl_ GOVDALMSAAQAMKIPILWYNAPGR— -DCGNKSSGGA PSKSAY-RSWIDEFAAGLKN 

Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin 0<hain (pdb code Ihds chain b, ref. 38, Left) and cellulase E2 
(PDB code Itml, ref 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
RASMOL (40). 




100 

Alignment length 



200 



Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in pdb90D-b: Each pair of nonhomologous proteins found with 
SSEARCH is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 



6076 Biochemistry: Brenner et al 

. Reliability of Statistical Scores (PDB90D-B) 




1e-06 " ' — ' " 

0.001 0.01 0.1 1 10 

Errors Per Query 



Fig. 4. Reliability of statistical scores in pdb90D-b: Each line shows 
the relationship between reported statistical score and actual error 
rate for a different program. E-values are reported for ssearch and 
FASTA, whereas P-values are shown for blast and, wu-blast2. If the 
scoring were perfect, then the number of errors per query, and the- 
E-values would be. the same, as indicated by the upper bold line. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the Ipwer bold line.) E-values from, 
ssearch and fasta are shown to have good agreement with EPQ but 
underestimate the signifiaince slightly, blast and wu-blast2 are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for pdb40D-b .were similar to those for pdb9od-b 
despite the difference in number, of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage, identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths. of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution matrix scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 



. Sequence Comparison Algorithms (PDB40D-B) 




Coverage 



Proc, Natl Acad. ScL USA 95 (1998) 

related to this score is scaled into^ bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds^ 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores for ; 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B aiialysis in Fig. 3, we learn that 30% 
identity is a reliable threshold for this diatabase only for/ 
sequence alignments of at least 150 residues. Because , one. 
unrelated pair of proteins has .43.5% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this particular size and composition. ■ • 

' At a given reliability, scores based oh percentage identity 
detect just a fraction of the distant homologs found by^ 
statistical scoring. If one measures the percentage identity in 
the aligned regions without consideration of alignment length,; 
then a negligible number of distant homologs are detected. 
Use of the HSSP equation improves the value of percentage 
identity, but even this measure can find only 4% of all known 
homologs at 1% EPQ. In short, percentage identity discards: 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, , 
it is difficult to choose appropriate thresholds because the' 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 



Sequence Comparison Algorithms (PDB90D-B) . 




Coverage 



Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). {A) PDB40D-B database. In this analysis, the best method is the slow sseaRCH, which finds 18% of relationships 
at 1% EPQ. FASTA ktup = 1 and wu-blast2 are almost as good. (B) PDB90D-B database. The quick wu-blasT2 program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 



Biochemistry: Brenner et al 



Proa Natl Acad. ScL USA 95 (1998) 6077 



likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has ^ 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find thiat statistical scores are not only powerful, but also 
easy to interpret, ssearch and FASTA show close agreement 
between statistical scores and actual number of errors per 
query (Fig54). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se-: 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way,'and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from BLAST also should be directly Interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. ,The results in. Fig. 5A and Table 1 show that pairwise 
sequence coniparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15%, was the worst performer, whereas . FASTA 
ktup = 1 is nearly as effective as ssEARCH. fastA ktup = 2 and 
WU-BLAST2 are- intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more' homologs are generally' 
slower, ssearch is 25 times slower than blast and 6.5 timies 
slower than FASTA ktup = 1. WU-BLAST2 is slightly faster than 
FASTA ktup == 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many • 
relationships is wu-blasT2. Consequently, we infer that the 
differences between fasta kup = 1, ssEARCH, and wu-bLast2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous' pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 



Distribution and Detection of Homologs (PDB40D-B) 



2500 



2000 



o 

X 

o 
o 

I 



1500 



1000 



. Total number of 
homologs In database 



Homologs detected by 
SSEARCH E-values 
at 1% EPQ 




500 



0 5 10 15 20 25 30 
Percentage Identity: In both 

Fig. 6. Distribution and detection of homologs in PDB40D-B. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method . 
(ssearch with E-values) at 1% EPQ. The pdb40E>-b database contains , 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged, ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments maybe inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find; related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
BLAST was released: BLASTGP (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on BLASTGP using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped BLAST, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (/) using a large current database 
in which the protein sequences have been complexity masked 
and («) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by FASTA and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and WU-BLAST2 underestimate the true 



Table I. Summary of sequence comparison methods with PDB40D-B 



Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


SSEARCH % identity: within alignment 


25.5 


>70% 


<0.1 


SSEARCH % identity: within both 


25.5 


34% 


3.0 


SSEARCH % identity: HSSP-scaled 


25.5 


35% (HSSP + 9.8) 


4,0 


SSEARCH Smith-Waterman raw scores 


25.5 


142 


10.5 


SSEARCH E-values 


25.5 


0.03 


18.4 


FASTA ktup = 1 E-values 


3.9 


0.03 


17.9 


FASTA ktup = 2 E-values 


L4 


0.03 


16.7 


WU-BLAST2 P-values 


1.1 


0.003 


17.5 


BLAST P-values 


LO 


0.00016 


14.8 


*Times are from large database searches with genome proteins. 



6078 Biochemistry: Brenner et al 



Proc. Natl Acad. ScL USA 95 (1998) 



extent of errors. Second, ssearch, wUrBLAST2, and fasta 
ktup = 1 perform best, though BLAST and fasta ktup = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. t 
The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error rate. 
Thus, if the- procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives it might have are distant ones.*? 

** Additional .and updated . informatipni; about .this work,, including 
supplementary figures, may be found at http://sss.stanford.'edu/s"ss/. 



The authors are grateful to Drs. A. G, Murain, M. Levitt, S. R. Eddy, 
and G. Mitchison for valiiable discussion. S.E.B. was principally 
supported by a St. John's College (Cambridge, UK) Benefactors*., 
Scholarship and by the American Friends of Cambridge University.: 
S.E.B. dedicates his contribution to the memory of Rabbi Albert T. 
and Clara S. Bilgray. ; 

1. AltscliuV S. R, Gish, W., MUler, W., Myers, E. W. & Lipman, ? 
D. J: (1990) /. MoL Biol. 215, 403-410. , , , 

2. Aitschul, S. F. & Gish, W. (1996) Methods Enzymol 266, 460- ' 

480... ; , . ... 

3. Pearson, W. R. & Lipman, J. (1988) Proc. Natl. Acad. ScL USA 
85,2444-2448. ^ . ' 

4. Murzin, G., Brenner, S^ E., Hubbard, t. & Chothia, C. (1995) < " 
/. Mol Biol 247, 536-540. 

5. Brenner, S. E., Chothia, C, Hubbard, T. J. P. & Murziii, A- G. i 
(1996) Methods Enzymol 266, 635-643. 

6. Pearson, W. R. (1991) Genomics % 635-650. 

7. Pearson, W. R. (1995) Protein Sci 4, 1145-1160; 

8. Smith,TF.&Waterman,M.S.(1981)/A/o/.Bw/.147,195-197. 

9. George, D. G., Hunt, L. T. & Barker, W. C. (\996) Methods ' 
Enzymol. 266, 41-59. 

10. Vogt, a, Etzold, T. & Argos, P. (1995)/. Mol Biol 249, 816-831. 

11. Henikoff, S. & Hentkoff, J. G. (1993) Proteins 17, 49-61. 

12. Bairoch, A. &. Apweiler, R. (1996) Nucleic Acids Res. 24, 21-25. 

13. Bairoch, A., Bucher, P. & Hdfmann, K, (1996) Nucleic Acids Res. : 
24, 189-196. ^ 

14. Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl Acad: Sci: USA 
89, 10915-10919. 

15. Dayhoff, M., Schwartz, R. M. & Orcutt, B, C. (1978) in Atlas of 
Protein Sequence and Structure, ed. Dayhoff, M. (National Bio- 



medical Research Foundation, Silver Spring, MD), Vol. 5, SuppL 
3, pp. 345-352. 

16. Brenner, S. E. (1996), Ph.D. thesis. (University of Cambridge, 
UK). . . 

17. Sander, C. & Schneider, R. (1991) Proteins 9, 56-68. 

18. Johnson, M. S. & Overington, J. P. (1993) /. Mol Biol 233, 
716-738. ... 

19. Barton, G! J: & Sternberg, M. J. E. (1987) Protein Eng. 1, 89-94. 

20. Lesk, A. M,, Levitt, M. & Chothia, C (1986) Protein Eng. 1, 
77-78. - • - 

21. Arratia, R., Gordon, L. & M, W. (1986) Ann. 5tor. 14, 971-993. 

22. Kaflin, S/& Aitschul, S. F. (1990) Proc: Natl Acad. ScL USA 87, 
2264-2268. 

23. Karlin, S. & Aitschul, S. F. (1993) Proc. Natl Acad. ScL USA 90, 
5873-5877. 

24. Aitschul, S. F.,Boguski, M. S., Gish, W. & Wootton, J. C. (1994).. 
Nat, Genet. 6, 119-129. 

25. Pearson, W. R. (1996) Mc//»od:y £nzy/7w/. 266, 227-258, . 

26. Upman, D. J., Wilbur, W. J,, Smith, t. F. & Waterman, M. S. 
(l9S4yNucleic Acids Res, 12,215-226. 

27. Wootton, J. C. &.Federhen, S. (1996) Methods Enzymol 266, 
554-571. 

28. Waterman, M. S. & Vingron, M. (1994) Stat. Science 9, 367-381. ' 
. . 29. Perutz, M. F., Kendrew, J. C. & Watson, H. C. (19iS5)7. Afol Biol: 

13, 669-678. ' ' * 

30. Aboli E. E., Bernstein, F. C, Bryant, S. H Koetzle, T. F. &' 
Wang, J. (1987) in Crystallogrdphic Databases: Information Con- 
tent, Software Systems, Scientific Applications, eds. Alien; F. H.y 
Bergerhoff, G. & Sievers, R. (Data Comm. Intl. Union Crystal- 
logr., Cambridge, UK),:pp. 107-132; ... : . i . 

31. Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997) Ctirr: Opin. . 
Struct. Biol 7,369-316. , - \ . V 

32. Orengo, C, Michie, A., Jones S, Jones D. T, Swindells M. B. & 
Thornton, J, (1997) Structure (London) 5, 1093-1108. 

33. Zweig, M. H. & Campbell, G. (i993) Clin. Chenu 39, 561-577, 

34. Gribskov, M. & Robinson, N. L. (1996) Comput, Chem. 20, 25-33. 
: 35. Fitbh, W. M. (1966) /. Mol Biol ,16, 9-16. , ^. , 

, 36. Chung, S. Y. & Subbiah, S. (1996) Structure (London) 4, 1123- 

1127. ^ ■ - ;^ 

37. Aitschul, S. F., Madden, t. L., Schaffer, A. A.; Zhang, J., Zliahg, 
Z., Miller,* W. & Upman, D; J. (1997) Nucleic Acids Res: 25, 
3389-3402. : ' ^ 

38. Girling, R,, Schmidt, W., Jr, Houston, T., Animal E. & Huisnian, 
T (1979) J. Mol Biol 131, 417-433. > 

39. Spezio, M., Wilson, D. Sc Karplus, P. (1993) Biochemistry 32, 
9906-9916 . 

40. Sayle, R. A. & Milner-White, E. J. (1995) Trends Biochem. ScL 
20, 374-376.. ... 



Docket No.: PF-0162-3 DIV 
USSN: 09/784,739' 
Ret No. 1 of 1 



Proc, Natl. Acad. ScL USA 

Vol 95, pp. 6073-6078, May 1998 

Biochemistry 

Assessing sequence comparison methods with reliable structurally 
identified distant evolutionary relationships 

Ste\ten E. BRENNER*tt, Cyrus Chothia*, >uw Tim J/P/HUBBARD§ 

•MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and ^Sanger Centre, Wellcome Trust Genome Campus, Hinxton, 
Carabs CBIO ISA, United Kingdom 

Communicated by David R. Davies, National Institute of Diabetes, BethesdOy MD, March 16, 1998 (received for review Noverhber IZ 1997) 



ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins wliose relationsliips are known 
reliably from their structures and functions, as described in 
the SCOP database [Murzin, A. G., Brenner, S. E., Hubbardj T. 
& Chothia C. (1995) /. MoL Biol 247, 536-540]. The evalua- 
tion tested the programs BIAST [Altschul, S. F., Gish, W., 
MUler, W., Myers, E. W. & Lipman, D. J. (1990). /. Mol, BioL 
215, 403-410], WU-BIAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods EnzymoL 266, 460-480], FASTA [Pearson, W. R. & 
Upman, D. J. (1988) Prqc. Natl Acad. ScL USA 85, 2444-2448] , 
and SSEARCH [Smith, f. F. & Waterman, M. S. (1981) J. MoL 
BioL 141 y 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E- value statistical scores of SSEARCH and FASTA are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and \vu-BLASt2 exaggerate significance by orders of 
magnitude. SSEARCH, FASTA ktup = 1, and wu-blasT2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identiHed may be used with confidence. 

Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unaihbiguously and independently of the 
methods being evialuated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, iniperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 

The publication costs of this article were defrayed in part by page charge 
payment. This article must therefore be hereby marked "advertisement'' in 
accordance with 18 U.S.C. §1734 solely to indicate this fact. 

© 1998 by The National Academy of Sciences 0027-8424/98/956073-6$ 2.00/0 
PNAS is available online at http://www.pnas.org. 



Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions • 
of programs commonly used. For example, parameters in: 
BLAST (1) have changed, and WU-BLAST2 (2) — ^which produces 
gapped alignments — has become available. The latest version 
of FASTA^B) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the, 
form of statistical scoring. 

The previous reports also have left gaps, in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discuss;ed statistical scoring 
measures have never actually been evaluated , on larg;e data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: . 
in an absolute sense, how well does paimise sequence j^^ 
parison work? That is, what fraction of hpmblpgous f)r^^^^ 
can be detected using modern database searcliing mette 

In this work, we attempt to answer these que$tiohs and to. 
overcome both of the fundamental difficulties that have hin- ' 
dered assessment of sequence ; compiafison methodologies. 
First, we use the set of distant evolutionary relationships in tfe ^ 
SCOP: Structural Classification of Proteins database. (4),^ which 
is derived from structural and functional characteristics (5). V 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that Jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Klodern heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein super families defined by the 
PIR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of PIR 

Abbreviation: EPQ, errors per query. 

tpresent address: Department of Structural Biology, Stanford Uni- 
versity, Fairchild Building D-109, Stanford, OA 94305-5126 

*To whom reprints requests should be addressed, e-mail: brenner@ 
hyper.stanford.edu. 



6073 



6074 Biochemistry: Brenner et al 



Proc Nad. Acad. ScL USA 95 (1998) 



superfamilies. Pearson found that modern matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with BLAST 
considered the ability to detect hbmologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62' matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. But in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in PIR and PROSltE are principally 
created by using the same sequence comparison methods 
which are being evaluated. Interdependency of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are clearly 
homologous, but pir places them in different superfamilies. 
The problem is wideispread: each superfamily in PIR 48.00 with 
a structural homolog is itself homologous to an average of 1.6 
other PIR superfaimilies (16). " 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithnis, their work focused on determining a length- 
dependent threshold of 'percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
w£is the HSSP equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the BLAST program using the 
Karlin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to FASTA and SSEARCH. In 
addition to being heralded as a reliable means of recognizing 
significant^ similar proteins (24, 25), the mathematical trac- 
tability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1). The validity of this scoring procedure has been 
tested analytically and enipirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
biological sequences (26, 27) and obviously do not contain any 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25, 
28), there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29), it has been 
apparent that comparing structures is a more powerful .(if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 



is very probable that they have an evolutionary relationship 
though their sequence similarity may be low. 

The recent growth of protein structure information com- 
bined with the comprehensive evolutionary classification in 
the.scop database (4, 5) have allowed us to overcome previous 
limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are knowii confidently. The scop database 
uses structural information to recognize distant l^omologs, the 
large majority of which can be determined unambiguously. 
These superfamilies, such as the globins or the immunoglobu- 
lins, would be recognized as related by the vast majority of the 
biological community despite the lack of high sequence sim- 
ilarity. ' * - . Vt' 

From . SCOP, iwe extracted, the sequences of domains; of i. 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One :(PDB90D-B) has domains, which ;Were all <90% 
identical to any other, whereas (pdb40PtB) had those <40%r, 
identical. The databases were created by first sorting all 
protein domaiiis in SCOP by their quality and making a list. The 
highest .quality domain , was . selected for inclusion in the*:, 
database and removed from the, list. Also remoyed from the list., 
(and discarded) were all other domains aboyei the threshold^, 
level of identity to the selected domain. TliisHprocess , was 
repeated until the list was empty. The pbB4bb-B datafiase 
contains 1,323 domains, which have 9,044 ordered pairs of 
distant relationships, or ^0,5% of thejt6tai 'i,749,p06 
, pairs. In PDB90D-B, the 2,079 domains' have '53^988 'rdatip^^^^^ 
ships, representing 1.2% of all pairs. Low coniplexity regioris^ 
of sequence can achieve spurious 'high scores, so these were 
nlasked in both databases by processing with the SEG program 
(27) using reconamended parameters: 12 1.8 2.0. The databases 
used in this papeir are available from http://sss.stanford.edu/ 
sss/, and databases derived from the ciirrent version of scop 
may be found, at http://scop,mrc-lmb .cam.ac.uk/sc6p/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 
heavy overrepresentation in the PDB of a small, number ' of 
families (31, 32), whereas PDB90D-B (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homolog results here are froni PDB40I>B. 
Although the precise numbers reported here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a single sequence cbmpar: 
ison algprithm at a time, we evaluated the . effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we examined the 
distribution of homologs and considered the power of pairwise 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested BLAST (1), version 1.4.9MP, and WU- 
BLAST2 (2), version 2.0a 13 MP. Also assessed was the FASTA 
package, version 3.0t76 (3), which provided FASTA and the 
SSEARCH implementation of Smith-Waterman (8). For 
SSEARCH and fasta, we used BLOSUM45 with gap penalties 
-12/-1 (7, 16). The default parameters and matrix (BLO- 
SUM62) were used for BLAST and WU-BLAST2. 

The "Coverage Vs. Error** Plot. To test a particular protocol 
(comprising a program and scoring scheme), each sequence 
from the database was used as a query to search the database. 
This yielded ordered pairs of query and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 



Biochemistry: Brenner et al 

Smith-Waterman Scoring Schemes (PDB40D-B) 



Proc, Natl Acad. Sci: USA 95 (1998) 6075 
Smlth-Watennan Scoring Schemes (PDB90D-B) 




I 0.1 

o 



0.01 



0.1 0.15 
Coverage 



0.25 



0.001 







1 i 

\ . j_ - ^1 — III — 


^ If W 

.isr..:. ^/.^ 



0.1 



0.2 0.3 
Coverage 



0.4 



0.5 



Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (A) Analysis of pdb40D-b database. (B) Analysis 
of ppB90E)-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the jc axis indicates the fraction of 
all hpmologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. PDB40D-B contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1^23 queries made in the pdb40I>-b all-vs.-all 
comparison, 13 errors corresponds to, 0.01, or 1% EPQ. They axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As niore homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right comer of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligiied region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The HSSP equation (17) is H = 29{i.\Sl-^^'^ vi)\Qx6 
/ is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity Hssp-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program. . ' 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are'the largest number 
of related pairs of sequences consistent with an acceptable 
error rate.- 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. eri^or plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency, is an aspect which has been largely 

, . Percent Identity of Unrelated Proteins (PpB90p-8) 




Eai^ pointplQta tbeitength .and. . 
percent identity of a^i alignment 
Lt>eiweeo.two.unrelated4)totetns . 



HenK>globin3-chain(1hdsb). Cellulase E2 (ItmlJ 

1 hdsb GKVDVDWGAQALGR- - LLVVYPWTQRFFQHPGKLSSAGAVMNNPKVKAHGKRVLDAFTQGLKH 
1tml_ GQVDALKSAAQAAGKIPILWYNAPGR- — DCGNHSSGGA PSHSAY-RSWIDEPAAGLKH 

Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin /3-chain (pdb code Ihds chain b, ref. 38, Left) and cellulase E2 
(PDB code Itml, ref, 39, Right) have 39% identity over 64 residues, a 
le^el which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
RASMOL (40). 




HSSP Threshold 



100 

Alignment length 



200 



Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in PDB90D-B: Each pair of nonhomologous proteins found with 
SSEARCH is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the HSSP threshold (though it is intended to be applied 
with a different matrix and parameters). 



6076 Biochemistry:' Brenner et al 



Proa Nad. Acad. Sci. USA 95 (1998) 



Reliability of Statistical Scores (PDB90DrB) 



10 
1 

£ 0.1 
o 

2 0.01 



SSEARCH E-Vatt 
FASTAIctup=1 E-Vatu( 



I 0.001 



s 

(0 



0.0001 



1e-05 




0.001 



0.1 

Errors Per Query 



Fig. 4. Reliability of statistical scores in pdB90D-b: Each line shows 
the relationship between reported statistical score and, actual error - 
rate for a different program. E-values are reported for ssEARCH^and 
FAST A, whereas P-values are shown for blast and wu-blasT2. If the 
scoring were perfect, then the number of errors per query; and the 
E-values would be the same, as indicated by the upper bold line.. 
(P-values should be the same as EPQ for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from ^ 
SSEARCH and fasta are shown to have good agreement with EPQ:but 
underestimate the significance slightly, blast and wu-blast2 are 
overconfident, with the degree of exaggeration dependent- upon the 
score. The results for PDB40D-B were similar to those, for pdb90D-b 
despite the -difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statistical 



ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a dear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported by data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed by summing 
the substitution ■maiti'ix scores' for each position- in the align- 
ment and subtracting gap penalties. In BLAST, a measure' 

Sequence Comparison Algorithms (PDB40D-B) , 



0.01 



0.001 




related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. ' 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35), there is a common 
rule-of-thumb stating that 30% identity signifies homology. 
Moreover, publications have indicated that 25% identity can 
be used as a threshold (17, 36). We find that these thresholds^ 
originally derived years ago, are not supported by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity; thus, the repwrted cutoffs 
lead to frequent errors. Fig. 2 shows one of the many pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aligned regions. 
Despite the high identity, the raw and the statistical scores. for i» 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. \. , 

From, the PDB9QD-B analysis in Fig. 3, we learn that "30% 
identity is a reliable . threshold for this database only for 
sequence alignments of, at least ,150 residues. Because one 
unrelated pair of proteins has 43.5% identity over 62 residues, ^ 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, foi: a database 
of this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs .found by 
statistical scoring. If one measures the percentage identity iii 
the aligned regionswithout consideration of alignment length, 
then a negligible number of distant honiologs are deitected.v 
Use of the HSSP equation improves the value of percentage 
identity, but even this. measure can find only 4% of all known 
homologs at 1% EPQ. In shorty percentage identity discards 
most of the information measured in a sequence coniparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1), but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise ■ 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ. However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the, proteins 
matched and the size of the database. Raw score ;thresholds 
also are affected by matrix and gap, parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the. problems that arise from raw scores. This 
scoring ^ scheme provides the best discrimination between 
homologous proteins and those which are unrelated. Most 

Sequence Comparison Algorithms (PDB90D-6) 



0.01 



0.001 



,0.16 0.18 
Coverage 




Fig. 5. Coverage vs. error plots of different sequence comparison methods: Five different sequence comparison methods are evaluated, each 
using statistical scores (E- or P-values). (A) PDB40D-B database. In this analysis, the best method is the slow sseaRCH, which finds 18% of relationships 
at 1% EPQ. FASTA ktup = 1 and wu-blast2 are almost as good. (B) pdb90D-b database. The quick wu-blasT2 program provides the best coverage 
at 1% EPQ on this database, although at higher levels of error it becomes slightly worse than fasta ktup = 1 and ssearch. 



Biochemistry: Brenner era!. 



Proc. Natl Acad. Set, USA 95 (1998) 6077 



likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately^ ' 

We find that statistical scores are not only powerftil, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an , 
E-value of 0.01 indicates that roughly one pair of nonhomologs ., 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. ' ' , 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significanas by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic, theory is 
fundamentally, appropriate. WU-BLAST2 scores were more re- 
liable than those from BLAST, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capableiof identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values,- the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ: blast, which 
identifies 15%, was the worst performer, whereas fasta' ; 
ktup = 1 is near^ as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower.' SSEARCH is 25 times slower than blast and 6.5 times 
slower than FA.STA ktup = 1. WU-BLAST2 is slightly faster than 
FASTA ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5J?). The method which finds that many 
relationships is WU-BLAST2. Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and WU-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance. ssEARCH with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 



Distribution and Detection of Homologs (PDB40D-B) 



2500 



2000 



1500 



1000 



500 

















Total number of 








homologs In database 










Homologs detected by , 










1 SSEARCH E-values 










1 / at 1% EPQ 













0 5 10 15 20 25 30 35 40 
Percentage Identity: In both 

Fig. 6. Distribution and detection of homologs in pdb4od-b. Bars 
show the distribution of homologous pairs PDB40D-B according to their 
identity (using the measure of identity in both). Filled regions indicate , 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The PDB40D-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationship's that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of . 
pariwise sequence comparison to detect them. . • * 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is, restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
BLAST was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped BLAST, but not 
quite equal to that of WU-BLAST2. 

CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and (h) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and WU-BLAST2 underestimate the true 



Table 1. Summary of sequence comparison methods with PDB40D-B 



Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at \% EPQ 


SSEARCH % identity: within alignment 


25.5 


>70% 


<0.1 


SSEARCH % identity: within both 


25,5 


34% 


3.0 


SSEARCH % identity: Hssp-scaled 


25.5 


35% (HSSP + 9.8) 


4.0 


SSEARCH Smith-Waterman raw scores 


25.5 


142 


10.5 


SSEARCH E-values 


25.5 


0.03 


18.4 


FASTA ktup = 1 E-values 


3.9 


0.03 


17.9 


FASTA ktup = 2 E-values 


1.4 


0.03 


16.7 


WU-BLAST2 p-values 


1.1 


0.003 


17.5 


BLAST P-values 


1.0 


0.00016 


14.8 


*Times are from large database searches with genome proteins. 



6078 Biochemistry: Brenner e/ a/. 



Proc. Natl Acad, ScL USA 95 (1998) 



extent of errors. Second, sseargh, WU-BLAST2, and FASTA 
ktup = 1 perforni best, though BLAST and FASTA ktup = 2 
detect most of the relationships found by the best procedures 
and aie^ appropriate for rapid initial searches.' , ^ 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliability from the huge 
numbervof unrelated pairs. However, even the] best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationshipsiat ah acceptable eriror rate. 
Thus, if the, procedures assessed ' here fail to find a^reliable 
match, it does not imply that the sequence is unique; rather, it 
indicates that any relatives vit might have are distant pries.** 

* * Additioiial and . u|>dated iiiformatibn .about this. Avprk, includinc 
supplementary figures, may be found at http://sss.stanford.edu/sss/! 



The authors are grateful to Drs. A. G. Murzin, M. Levitt, S. R. Eddy, 
and G. Mitchison for valuable, discussion. S^B.B. was principally 
supported \by' a SL John's, College (Cambriiige, UK) Benefactors' 
Scholarship and by the American Friends of Cambridge University. 
S.E.B. dedicates his contribution to the , memory of Rabbi Albert T, 
and QaraJS. Bilgray. . , ^ 

1. Altschui, S. F., Gish, w!, MUler, W., Myers, E. W. & Lipman, 
D: J. (1990) /. MoLBioL 215, 40^410. 

2. Altschui, S. F. & Gish, W. (1996) Methods EnzymoL 266, 460-. 

•■ 48o:' - ; . ■ :v ..; 

3. Pearson, W. R. & Upman, b, J. (1988) Proc. Natl Acad. Set. USA 
85,2444-2448..; . \, , . 

4. Murzin, A G., Breriner,'S. Hubbard, T. & Chothia^' C. (1995) 
/ MoL Biol 247, 536-546; 

5. Brenner, S. .E., Chothia, C, Hubbard, T. J. P. & Murzin, A. G. 
{1996) Methods Ehzymoi 16^^ 

6. Pearsbii, W. 'r: (1991) Gcrto/mc^ 

7. Peahen, W. R: (1995) Protein ScL 4, 1145-1160. 

8. Smith;T. F. &Waterman;M.S.(1981)/. 147,195-197. 

9. George, D. G., Hunt, L. T. & Barker, W. C. (1996) Methods 
Enzyme*/. 266, 41^59.: ' : 

10. .Vogt,'G., Etzold,T. & Argos, P. (1995)/. MoL Biol 249, 816-831. 

11. Henikoff, S. & Henikoff, J, G. (1993) Proteins 17, 49-61. 

12. Bairoch, A. & Apweiler,' R. (1996) Nucleic Acids Res. 24, 21-25. 

13. Bairoch, A, Bucher, P. & Hof mann, K. (1996) Nucleic Acids Res. 
24, 189-196. . 

14. Henikoff, S. & Henikoff, J. G. (1992) Proc. Natl Acad. Sci. USA 
89, 10915-10919. 

15. Dayhoff, M., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of 
Protein Sequence and Structure, ed. Dayhoff, M. (National Bio- 



medical Research Foundation, Silver Spring, MD), Vol. 5, Suppl. 
3, pp. 345-352. 

16. Brenner, S. E. (1996) Ph.D. thesis. (University of Cambridge, 
UK). , . 

17. Sander, C. & Schneider, R. (1991) Proteins 9, 56-68. 

18. Johnson, M. S. & Overington, J. P. (1993) /. MoL BioL 233, 
716-738. 

' 19. Barton, G. J. & Sternberg, M. J. E. (1987) Proteiri Eng.'l, 89-94. 
20. Lesk, A. M., Levitt, M. & Chothia, C (1986) Protein Eng. V, 
77-^78. 

' 21. Arratia, R., Gordon. L. & M, W. (1986) Ann. Stat. 14, 971-993. 
, 22. Karlin, S. & Altschui, S. F. (1990) Proc. Natl Acad. ScL USA 87, 
2264-2268. . 

23. Karlin, S. & Altschui, S. F. (1993) Proc. Natl Acad:. ScL USA 90, 
5873-5877. 

24. Altschui, S. F., Boguski, M. S., Gish, W. & Wootton, J. C. (1994) 
Nat. Genet. 6, 119-129. 

25. Pearson, W. R. (1996) Methods EnzymoL 266, 227-258. 

26. Lipman, D. J., Wilbur, W. J., Smith, T. F, & Waterman, M. S. 
{19S4) Nucleic Acids Res. 12,215-226. 

'21. Wootton, J. C. & Federhen, S. (1996) Methods EnzymoL 266, 
554-571. . . ' . ; 

28! Waterman, M. S. & Vingron, M. (1994) Stat. Science 9, 3i57-381. 

29; Perutz, M.F.,Kendrew, J. C.& Watson, H.C. (1965)/. Mo/. Bio/. 
13,669-678. 

30. Abola, E. E., Bernstein, F. C, Bryant, S. H,, Koetzle, T. F. &. 
Weng, J. (1987) in Crystallographic Databases: Information Con- 
tent, Software Systems, Scientific Applications, eds. Allen; F. H., 
Bergerhoff, G. & Sievers, R. (Data Comm. Intl. Union Crystal- 
logr., Cambridge, UK), pp. 107-132. , 

31. Brenner, S, E., Chothia, C. & Hubbard, T. J. P. (1997) Curr. Opin, ■ 
Struct. BioLl, 369-376. :V 

32. Orengo, C, Michie, A., Jones S, Jones D. T, Swindells M. B::&, 
Thornton, J. (1997) Structure (London) 5, 1093-1108. * S 

33. Zweig, M. H. &, Campbell, G,.(1993) Clin. C/ie/n. 3?,.561-577, 

34. Gribskov, M. & Robinson, N. L. (1996) Comput. Chenu 20;'25-33. , 

35. Fitch, W. M. (1966) / MoL BioL 16, 9-16.= . ' . . 

36. Chung, S. Y, & Subbiah, S. (1996) Structure (Londori) 4* 1123- r 
1127. ^ , 

37. Altschui, S. F., Madden, T. L., Schaffer, A A, Zhang, J., Zhang, 
Z., Miller, W. & Lipman, D; J. (1997) Nucleic Acids Res: 25, 
3389-3402. : 

38. Girling, R., Schmidt, W., Jr, Houston, T, Amma, E. & Huisman, 
T. (1979) / MoL BioL 131, 417-433, 

39. Spezio, M., Wilson, D. & Karplus, P. (1993) Biochemistry 32, 
9906-9916 

40. Sayle, R. A. & Milner-White, E. J. (1995) Trends Biochem. Sci. 
20, 374-376. • - 



